Cloudfail: Lessons Learned from AWS Outage

image_pdfimage_print

The Amazon AWS outage has cast questions as to whether AWS (and the cloud in general) is ready for hosting revenue-critical production applications. The outage lasted for more than a day for many popular sites like Reddit and Zuora, and it raised many doubts about cloud computing.

But before we write off the cloud, let’s review a few lessons we can learn from this outage.

Some survived, many did not
The number one lesson to learn is that not EVERY application running in AWS died. Netflix, one of the biggest web apps running in AWS, survived the outage without any issues while sites like Reddit and Zuora crashed for more than a day. So why is it that some survived and many did not? It’s simply because many of these companies forgot that cloud is not a magical solution to everything, and you still have to remember to implement the architectural techniques that have been perfected for years in the physical data center world as you move in the cloud world.

“Multi-Availability Zone” vs “Multi-Region” redundancy

The primary reason why Netflix’s architecture survived while many others failed is multi-AvailabilityZone versus multi-Region redundancy. This is what Amazon’s own docs say:

Amazon EC2 provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of Regions and Availability Zones. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries. The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region. Amazon EC2 is currently available in five regions: US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).

Since Amazon promises that Availability Zones within a Region are “engineered to be insulated,” this has led most people to just rely on multiple Availability Zones within a Region to protect against failures versus using multiple Regions. Yesterday’s outage represented the failure of multiple availability zones in a Region and most people were not engineered for that.  But applications that were engineered to use multi-Region redundancy, such as Netflix, survived easily.Of course, there is cost associated with multi-Region redundancy both in terms of network latency and the bandwidth dollars. But if you cut corners on multi-site failover strategy, whether in the data center or the cloud, you are bound to run into a situation like this at some point. Don’t forget that data centers do and will fail from time to time, and the cloud doesn’t change that fact.

Why Architectural transparency is important
One of the reasons why people had trouble understanding the availability zone implications was because Amazon has a cloak of secrecy in regards to how it really works. Amazon has never made it clear what does “engineered to be insulated” really mean–does it mean these are different racks, different floors in the data center, different buildings, and how does the power source, networking backbone insulation etc really work? For a radical technology shift like cloud to really successful, full architectural transparency is a must so that application architects have all the information to make the correct decisions. This outage clearly proves why demanding transparency from cloud providers is critical.

“Multi-Cloud” Redundancy?
Finally, the outage this week should also  open the debate on “multi-cloud” redundancy for critical applications. Many of the cloud providers do not provide good multi-Region support like Amazon does, and if you decide to go with those cloud providers, its very important to think about having a multi-cloud failover plan in place.
  • Pingback: AWS Outage « Neev Technologies

  • http://twitter.com/jimkaskade jimkaskade

    Application availability needs to be led by the application owner….there’s a lot more work required to make sure you have covered the requirements of “application uptime”. This is not an easy task….which is why most are exposed (e.g. multi-hour snapshots, or continuous data protection with real-time data replication? Warm or Hot standby virtual machines with your application?)

Copyright © 2014 AppDynamics. All rights Reserved.