The Amazon AWS outage has cast questions as to whether AWS (and the cloud in general) is ready for hosting revenue-critical production applications. The outage lasted for more than a day for many popular sites like Reddit and Zuora, and it raised many doubts about cloud computing.
But before we write off the cloud, let’s review a few lessons we can learn from this outage.
The primary reason why Netflix’s architecture survived while many others failed is multi-AvailabilityZone versus multi-Region redundancy. This is what Amazon’s own docs say:
Amazon EC2 provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of Regions and Availability Zones. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries. The Amazon EC2 Service Level Agreement commitment is 99.95% availability for each Amazon EC2 Region. Amazon EC2 is currently available in five regions: US East (Northern Virginia), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).
Since Amazon promises that Availability Zones within a Region are “engineered to be insulated,” this has led most people to just rely on multiple Availability Zones within a Region to protect against failures versus using multiple Regions. Yesterday’s outage represented the failure of multiple availability zones in a Region and most people were not engineered for that. But applications that were engineered to use multi-Region redundancy, such as Netflix, survived easily.Of course, there is cost associated with multi-Region redundancy both in terms of network latency and the bandwidth dollars. But if you cut corners on multi-site failover strategy, whether in the data center or the cloud, you are bound to run into a situation like this at some point. Don’t forget that data centers do and will fail from time to time, and the cloud doesn’t change that fact.
One of the reasons why people had trouble understanding the availability zone implications was because Amazon has a cloak of secrecy in regards to how it really works. Amazon has never made it clear what does “engineered to be insulated” really mean–does it mean these are different racks, different floors in the data center, different buildings, and how does the power source, networking backbone insulation etc really work? For a radical technology shift like cloud to really successful, full architectural transparency is a must so that application architects have all the information to make the correct decisions. This outage clearly proves why demanding transparency from cloud providers is critical.