Amazon Web Services (AWS) revolutionized production application deployments through elastic scalability and an hourly payment plan. Companies can pay for the infrastructure that they need for any given hour of the day, scaling up to meet high user demand during peak hours and scaling down during off peak hours. One of the core components of AWS is the Elastic Compute Cloud (EC2), which is Amazon’s abstraction of a virtual machine running in the cloud. It has many of the same characteristics as traditional virtual machines, but it is tightly integrated into the AWS ecosystem, which means that it has many characteristics that are new or different than traditional virtual machines.
This blog kicks off a series on your guide to the top 5 performance challenges you might come across managing an AWS EC2 environment, and how to best address them.
Running in a Multi-Tenancy Environment
Amazon EC2 instances are virtual machines that run in Amazon’s cloud and, like all virtual machines, they ultimately run on physical hardware. One side effect of running virtual machines in an environment that you do not own is that you cannot control what other virtual machines run next to you on the same physical hardware and some of your neighbors may be noisier than others. Basically, the performance of EC2 instances can sometimes be spotty and you need to be able to effectively identify when your virtual machines are running on spotty hardware and react accordingly.
So, how do you know if you have noisy neighbors?
Answer: You need to review both the physical runtime characteristics of your virtual machines as well as Amazon’s CloudWatch metrics. You can monitor the AWS CloudWatch metrics, OS and application level metrics and correlate them using an APM tool, such as AppDynamics. When examining the runtime behavior of a virtual machine using tools like top or vmstat, you’ll observe that, in addition to returning the CPU usage and idle percentages, the operating system also returns the amount of “stolen” CPU. Stolen CPU is not as diabolical as it sounds, it is a relative measure of the cycles of a CPU that should have been available to run your processes, but were not available because the hypervisor diverted cycles away from your instance. This may have happened because you have met your allotted quota of CPU usage or because another instance running on the same physical hardware is occupying the available CPU capacity. A little investigation will help you discern between the two.
First, you need to know the underlying CPU powering the hardware running your virtual machine. You can connect into your machine and execute the following command to view the CPU information:
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2651 v2 @ 1.80GHz
stepping : 4
cpu MHz : 1800.083
cache size : 30720 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 apic cmov pat clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc up pni ssse3 popcnt
bogomips : 3647.67
clflush size : 64
In this example, the EC2 instance is running on a 1.8 GHz Xeon processor. Because an ECU is equivalent to a 1.0-1.2 GHz Xeon processor then 100% ECU utilization would be between 55% and 66% of the physical CPU usage. Next, we need to look at the runtime behavior of our virtual machine, which can be accomplished by using the top or vmstat command:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1101516 162504 383016 0 0 0 6 1 1 0 0 100 0 0
The “us” metric reports the CPU usage, the “id” metric reports the idle percentage of the CPU, and the “st” metic reports the amount of stolen CPU. The key to identifying noisy neighbors is to observe the metrics over time pay particular attention to the stolen CPU time when your virtual machine is idle. If you see discrepancies in the stolen time when your CPU is idle then that indicates that you are sharing the same hardware with other customers. For example, when your CPU usage is idle and the stolen CPU percentage is 10% once and then another time the stolen CPU usage is 40% then chances are that you are not simply observing hypervisor activity, but rather are observing the behavior of another virtual machine.
In addition to reviewing the operating system CPU utilization and stolen percentage, you need to cross reference this with Amazon’s CloudWatch metrics. The CloudWatch CPUUtilization is defined by Amazon as follows:
The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance. Depending on the instance type, tools in your operating system may show a lower percentage than CloudWatch when the instance is not allocated a full processor core.
The CloudWatch CPUUtilization metric is your source of truth for truly understanding your virtual machine’s processing capability and, combined with the operating system’s CPU metrics, can provide you with enough information to discern the presence of a noisy neighbor. If the CloudWatch CPUUtilization metric is at 100% for your EC2 instance and your operating system is not reporting that you’ve reached 100% of your ECU capacity (one core of your physical CPU type / 1.0-1.2 GHz) and the CPU stolen percentage is high, then you have a noisy neighbor that is draining compute capacity from your virtual machine. For example, in the aforementioned scenario of 1 ECU on a machine with a 2.4 Ghz physical CPU, we would expect 100% ECU usage to be about 50% of the physical CPU. If the operating system reports that we’re at 30% CPU utilization with a high stolen CPU usage and CloudWatch reports that we’re at 100% usage, then we have identified a noisy neighbor.
Figure 1 presents an example of the CloudWatch CPU Utilization as compared to the operating system CPU utilization and stolen percentages, captured graphically over time. This illustrates the discrepancy between how CloudWatch sees your instance and how your instance sees itself.
Figure 1. CloudWatch vs OS CPU Metrics
Once you’ve identified that you have a noisy neighbor, so how should you react? You have a few options, but here’s what we recommend:
- Observe the behavior to determine if it is a systemic problem (is the neighbor constantly noisy?)
- If it is a systemic problem then move. While this might not be the best choice if you live in an apartment building, in Amazon it is simple: start a new AMI instance, register it with your ELB, and decommission the old one. If you get into the habit of viewing EC2 instances as disposable it will help you resolve these types of issues quickly
- Consider implementing a rolling EC2 instance strategy. Again, building on the idea that EC2 instances are disposable, it is a good practice to only keep your EC2 instances around for a short period of time, such as a few hours. Just as your laptop benefits from being restarted regularly, so will your server instances. Over time your instances may add clutter to your memory and, rather than work overly hard to clean them up, it is far easier to replace them with new ones. The best strategy that I have seen in this space is to manage your rolling instances through your elasticity strategy. The cloud allows you to scale up when you need more capacity (peak hours) and scale down when you need less capacity (off-peak hours). Getting into the habit of decommissioning the oldest instances when you scale down can shorten the average lifespan of your EC2 instances and achieve a rolling EC2 instance strategy.