At the recent Silicon Valley Cloud Computing Meetup, Netflix presented their lessons learned from their migration to the Amazon Cloud for its revenue-critical applications. Netflix is the leading online movie service and their business growth has been astonishing. Take a look at their stock chart for the last year.
The presenter was Adrian Cockcroft – he is the chief cloud architect for Netflix. They are true cloud pioneers and this may be the largest revenue-critical application running on Amazon AWS, generating over $2B a year.
We’re proud to say that AppDynamics has been working hand-in-hand with Netflix for the last 12 months to help manage the performance and availability of their highly-distributed cloud application. Adrian shows some of our application monitoring and code-level diagnostic screens during his talk to explain how they identify and resolve performance problems with cloud-based applications.
Click here to watch the recording.
Below are my takeaways from the session. Let me know your thoughts.
Why did Netflix migrate from a physical data center environment to a cloud environment?
#1 reason he states is “business agility” – the ability to quickly build and release new products (ie iPhone/iPad movie streaming) without having to dramatically ramp up expensive capacity in their physical data center. Some new services are capacity intensive – and their ability to provision 100’s or 1000’s of cloud nodes has sped their time-to-market with new movies and new products.
Netflix is also experiencing tremendous business growth, with 40% growth Y/Y member growth. Thus, they also have a need for more capacity to serve this higher demand. Adrian stated that some of the demand spikes were hard to predict; thus, the need for elastic capacity.
The #2 reason he states is to avoid “undifferentiated heavy lifting.” By using cloud capacity, they no longer have to do the things in the data center that don’t differentiate Netflix from its competitors. They can focus all of their time and passion on innovation and differentiation.
Note – He doesn’t cite cost-savings as the #1 or #2 reason.
What is different about managing applications in a physical data center vs a cloud environment?
Quick answer: Everything. Adrian made a pretty bold statement – “Datacenter oriented tools don’t work” in the cloud environment.
“More things to manage” by a factor of 10: Whereas the physical data center may have had 40-50 megaservers in the past, the cloud nodes are made up of 1000’s of commodity, low-cost servers.
Thus, an individual server means less. Managing application performance and availability by the health of servers (CPU utilization, memory utilization) is no longer a reliable proxy for application health.
Dynamic vs Static: No longer is the same set of megaservers serving traffic each and every day. Cloud servers are easily replaced and 100’s of instances can be added or dropped in a minute. Thus, any concept of management that relied on a static set of servers, connections, agents, etc…is severely outdated. No longer can management solutions expect that their agents will persist on the same machines for months or years. The lifespan of a node may be 5 days or less.
Reinventing the Agile Release Process: When new capabilities are ready to be released, you no longer need to update/patch the existing servers. You now have the option to put the new release binaries on 100’s of new cloud instances – send traffic to them – verify that they are performing well….and then take down the 100’s of nodes with the old release. “Dark Launch” feedback mechanisms just got even better.
Relationships change: Amazon becomes their IT Operations/Infrastructure department and the relationship of App Dev & Architecture for the new cloud apps is with Amazon.
How do APM solutions need to architected to work in the Amazon Cloud?
Suffice it to say that a lot has to change. Adrian deserves the credit for dozens of features that have gone into AppDynamics 2.x and 3.0 releases. I won’t do a full sales pitch in this blog – but let me highlight two pretty obvious situations that must be handled elegantly in this highly distributed and dynamic environment:
1) The APM solution must be able to monitor 1000’s of cloud nodes from a single management server to provide end-to-end transaction performance metrics and tracing. If the APM solution can only scale to 200:1 – you will need multiple consoles and you won’t have a single pane of glass.
2) The APM solution must be able to handle 100’s of nodes being provisioned and de-provisioned. The performance monitoring, metrics, transaction tracing, service dependency modelling, and deep diagnosics all need to work in this extremely dynamic environment. Legacy APM solutions that don’t dynamically adapt to infrastructure changes will become useless quickly.
Some of our AppDynamics 3.0 cloud innovations are explained here.
If you don’t follow Netflix’s cloud activities or Adrian, you should. Their path into the cloud is one that any company stands to learn a lot from. If you have any questions about our work with Netflix or about how to manage your app performance in the cloud, be sure to let us know.