If you’re just starting, check out Part 1, Part 2, and Part 3 to get up to speed on your guide to the top 5 performance challenges you might come across managing an AWS Elastic Compute Cloud (EC2), and how to best address them. We kicked off with the ins and outs of running your virtual machines in Amazon’s cloud, and how to navigate your way through a multi-tenancy environment along with managing different storage options. Last week, we went over how to identify the right applications run on EC2 instances for your unique workloads. This week, we’ll wrap up by handling Amazon’s Elastic Load Balancer (ELB) performance and overall AWS outages.
Poor ELB Performance
Amazon’s Elastic Load Balancer (ELB) is Amazon’s answer to load balancing that integrates seamlessly into the AWS ecosystem. Rather than sending calls directly to individual EC2 instances we can instead insert an ELB in front of our EC2 instances, send load to our ELB, and then the ELB will distribute that load across the EC2 instances. This allows us to more easily add and remove EC2 instances from our environment and affords us the optimization of leveraging auto-scaling groups that grow and shrink our EC2 environment based on our rules or based on performance metrics. This relationship is shown in figure 3.
Figure 3. ELB-to-EC2 Relationship
While we may think of an ELB as a stand-alone appliance, like a Cisco LocalDirector or F5 BigIP, under-the-hood ELB is a proprietary load balancing application running on EC2 instances. As such, it benefits from the same elastic capabilities that your own EC2 instances do, but it also suffers from the same constraints of any load balancer running on in a virtual environment, namely it must be sized appropriately. If your application receives substantial load then you need to have enough ELB instances to handle and distribute that load across your EC2 instances. Unfortunately (or fortunately depending on how you look at it) you do not have visibility into the number of ELB instances or their configurations, you must rely on Amazon to manage that complexity for you.
So how do you handle that scale-up requirement for applications that receive substantial load? There are two things that you need to keep in mind:
Pre-warming: if you know that your application will receive substantial load or you are expecting flash traffic then you should contact Amazon and ask them to “pre-warm” the load balancer for you. Amazon will then configure the load balancer to have the appropriate capacity to handle the load that you expect.
Recycling ELBs: for most applications it is advisable to recycle your EC2 instances regularly to clean up memory or other clutter that might appear on a machine, but in the case of ELBs, you do not want to recycle them if at all possible. Because an ELB consists of EC2 instances that have grown, over time, to facilitate your user load, recycling them would effectively reset their capacity back to zero and force them to start over.
Handling AWS Failures and Outages
Amazon failures are very infrequent, but they can and do occur. For example, AWS had an outage in its North Virginia data center (US-EAST-1) for 6-8 hours on September 20, 2015 that affected more than 20 of its services, including EC2. Many big-name clients were affected by the outage, but one notable company that managed to avoid any “significant impact” was Netflix. Netflix has created what it calls its Simian Army, which is a set of processes with colorful names like Chaos Monkey, Latency Monkey, and Chaos Gorilla (get the simian reference?) that regularly wreak havoc on their application and on their Amazon deployment. As a result they have built their application to handle failure, so the loss of an entire data center did not significantly impact Netflix.
AWS runs across more than a dozen data centers in the world:
Amazon divides the world into regions and each region maintains more than one availability zone (AZ), in which each AZ represents a data center. When a data center fails then there are redundant data centers in that same region that can take over. For services like S3 and RDS, the data in a region is safely replicated between AZs so that if one fails then the data is still available. With respect to EC2, you need to choose the AZs to which you deploy your applications so it is advised that you deploy multiple instance of your applications and services to multiple AZs in your region. You are in control of how redundant your application is, but it also means that you need to run more instances (in different AZs), which equates to more cost.
Netflix’s Chaos Gorilla is a tool that simulates a full region outage for Netflix and they have tested their application to ensure that it can sustain a full region outage. Cross AZ data replication is available to you for free in AWS, but if you want to replicate data across regions then the problem is far more complicated. S3 supports cross-region replication but at the cost of transferring data out to and in from other regions. RDS cross-region replication varies based on the database but sometimes at much higher costs. Overall, Adrian Cockcroft, the former chief architect of Netflix, tweeted that the cost to maintain active-active data replication across regions is about 25% of the total cost.
All of this is to say that resiliency and high availability are at odds with both financial costs as well as the performance overhead of data replication, but are all available to the diligent. In order to be successful at handling Amazon failures (and scheduled outages for that matter), you need to architect your application to protect against failure.
Amazon Web Services may have revolutionized computing in the cloud, but it also introduced new concerns and challenges that we need to be able to identify and respond to. This paper presented five challenges that we face when managing an EC2 environment:
Running in a Multi-Tenancy Environment: how do we determine when our virtual machines are running on hardware shared with other virtual machines and those other virtual machines are noisy?
Poor Disk I/O Performance: how do we properly interpret AWS’s IOPS metric and determine when we need to opt for a higher IOPS EBS volume?
The Wrong Tool for the Job: how do we align our application workload with Amazon optimized EC2 instance types?
Poor ELB Performance: how do ELBs work under-the-hood and how do we plan for and manage our ELBs to match expected user load?
Handling AWS Failures and Outages: what do we need to consider when building a resilient application that can sustain AZ or even full region failures?
Hopefully this series gave you some starting points for your own performance management exercises and helped you identify some key challenges in your own environment that may be contributing to performance issues.