If I had a dollar for every time I heard the phrase “We used to throw hardware at performance issues”–well, I wouldn’t be writing this blog; I’d be sitting on a beach somewhere in Hawaii, retired with millions in the bank and suffering from an addiction to tequila and nachos. Hardware over the years has become the problem and solution to all of life’s application performance issues. You could argue IT Ops has become obsessed with CPU metrics, and using % CPU utilization of a server or process to figure out whether an application is having issues or performing correctly. With Cloud around the corner, CPU and compute resources are once again a top priority for organisations and operations teams who are planning to migrate their applications into the new elastic fantastic world of IaaS and PaaS.
Let me walk you through a simple example of why it’s better to understand what burns CPU in an application, rather than to just look at the % CPU utilization of servers and make naive decisions about buying more hardware or over provisioning.
Below is a screenshot showing the current and average % CPU utilization for a real e-commerce application, comprised of 50 JVM’s in production that was suffering from performance issues:
Everything looks OK–nearly all JVMs are running at ~30% utilization so the application must be running fine–right? Well, sure. That is, except for the fact that it wasn’t running fine that day. Just because things are glowing green on capacity isn’t a real or trusted indicator of end user experience.
Here is another screenshot of the same application which shows the average (bottom line) and maximum (top line) % CPU utilization for the JVMs over the space of a day. Looking at the average % CPU utilization (bottom line) shows a similar story to the previous screenshot: the application infrastructure is barely utilized and everything is OK.
If you look at the maximum % CPU utilisation, you can clearly see three spikes where CPU resource for the application becomes totally exhausted, thus causing slow performance as the JVMs and threads contend for CPU cycles to process incoming user requests. Looking at this data, no one is going to get fired for saying “I think we need to buy more hardware so we don’t run out of capacity.” In fact, in most organisations, that is exactly what happens, especially when developers realise the application doesn’t scale in performance testing. This scenario is certainly also a valid use case for auto-scaling in the cloud where CPU can be provisioned on-the-fly in minutes. However, as I’m sure many of you are aware, buying more hardware just prolongs the pain and glosses over the real underlying issues of the application, rather than the infrastructure.
A much better question to ask in this example would be: “What in the application is responsible for burning that CPU?” Sounds obvious, right? But it’s not that simple because the only visibility IT Operations has is to see which processes (e.g. Java.exe) consume the most CPU on each server, They can’t go any deeper than identifying which JVMs are burning more CPU than others. This is where Application Performance Management (APM) tools such as AppDynamics can help.
Below is another screenshot showing the performance and health of all the business transactions (user requests) that flowed through the e-commerce application in production, along with classifying how many types of business transactions were slow, very, stalled and erroneous. The most relevant metric here in this story is the “CPU Used” column to the far right, which shows the average CPU burn (milliseconds) for each business transaction. You can see that one transaction, “Search” stands out with a high average CPU burn of 797ms per transaction. We can also see Search has several thousand requests that were either classified as slow or very slow, indicating a significant deviation from its normal performance baseline.
Let’s take a look at one slow Search transaction in more detail:
The above screenshot shows that this slow Search business transaction took 12.6 seconds to execute, and over 4.4 seconds of that time was spent burning CPU. 4.4 seconds of CPU for a simple search transaction. How is that possible?? Let’s take a look at the hot spots for this search transaction to understand what application code was contributing to this CPU burn:
The above screenshot shows that for each search result, the application has to retrieve product details information back using standard Java Beans. This all sounds fine–that is, right up until you look at the code execution for each getProduct() method call, which is called over 15 times for this Search transaction:
Holy EJB Appman! You can see every getProduct() call makes several EJB calls, which in turn pull product data from the database using multiple JDBC calls. Unfortunately the code execution (and latency) above happens for every product in the search results. It’s therefore no surprise that this transaction burns significant CPU cycles, as it uses hundreds of EJB and JDBC calls to persist data back to the user. Operations could communicate data like this back to developers so they could optimize application logic and ensure that transactions like these use significantly less CPU, thus minimizing capacity spikes and any associated business transaction latency along the way. The last thing you want is a crap end user experience because your servers are on their knees processing inefficient application code which developers are responsible for.
If you’re thinking of moving to the Cloud anytime soon, you might want to check which business transactions in your application are responsible for high CPU burn. If one or two transactions are burning 80% of your total CPU resource, then that’s an easy opportunity to save money–rather than to spend it in the Cloud.
If you want to know how much CPU your applications and business transactions burn, take a free 30-day trial of AppDynamics Pro and get started today. You might end up making your application scale without spending needless money on hardware – FamilySearch did and managed a 10X improvement in throughput and performance of their application.
So long suckers!