CAT | The Usual Suspects
The reason for this blog is purely down to a real-life incident which one of our e-commerce customers shared with us this week. It’s based around a use case that pretty much anyone can relate to – the moment your checkout transaction spectacularly fails. You sit there, looking at a big fat error message and think “WTF – did my transaction complete or did the company steal my money?” A minute later you’re walking a support team through exactly what happened: “I just clicked Checkout and got an error…honestly…I waited and never got a response.”
What’s different in this story is that the support team had access to AppDynamics as they were talking to a customer on the phone…and the customer got to find out the real reason their checkout failed. How often does that happen? Never, until now. Here is the story as documented by the customer.
Apple has done a stellar job with their development platform and iOS. In fact, they’ve done a stellar job turning my living room into an apple showroom. If you asked me 10 years ago whether my laptop, mouse, keyboard, monitor, phone, music player, TV and tablet would be colored white with an Apple logo I would have probably laughed in your face. The only Microsoft thing left in my house now is an XBOX, and it won’t be long before that turns white as well. Being married also presents a problem in that I now have two of everything, because sharing isn’t caring when it comes to Apple gadgets. With Apple technology being “cool” and widely adopted by millions of users, you can see why every business is migrating their applications to iOS for an improved end user experience. One of our customers recently made the move, and here’s a story of how their new iPhone app crashed their entire mission-critical web application….and I bet you weren’t expecting me to say that, were you?
An unusual spike in performance
Below is screenshot from AppDynamics that shows monitoring data for the customers online web application over the last month. The application has approximately 250 IIS instances, a dozen databases, a dozen web services and a distributed cache.
If I had a dollar for every time I heard the phrase “We used to throw hardware at performance issues”–well, I wouldn’t be writing this blog; I’d be sitting on a beach somewhere in Hawaii, retired with millions in the bank and suffering from an addiction to tequila and nachos. Hardware over the years has become the problem and solution to all of life’s application performance issues. You could argue IT Ops has become obsessed with CPU metrics, and using % CPU utilization of a server or process to figure out whether an application is having issues or performing correctly. With Cloud around the corner, CPU and compute resources are once again a top priority for organisations and operations teams who are planning to migrate their applications into the new elastic fantastic world of IaaS and PaaS.
Let me walk you through a simple example of why it’s better to understand what burns CPU in an application, rather than to just look at the % CPU utilization of servers and make naive decisions about buying more hardware or over provisioning.
Below is a screenshot showing the current and average % CPU utilization for a real e-commerce application, comprised of 50 JVM’s in production that was suffering from performance issues:
Everything looks OK–nearly all JVMs are running at ~30% utilization so the application must be running fine–right? Well, sure. That is, except for the fact that it wasn’t running fine that day. Just because things are glowing green on capacity isn’t a real or trusted indicator of end user experience.
Here is another screenshot of the same application which shows the average (bottom line) and maximum (top line) % CPU utilization for the JVMs over the space of a day. Looking at the average % CPU utilization (bottom line) shows a similar story to the previous screenshot: the application infrastructure is barely utilized and everything is OK.
If you look at the maximum % CPU utilisation, you can clearly see three spikes where CPU resource for the application becomes totally exhausted, thus causing slow performance as the JVMs and threads contend for CPU cycles to process incoming user requests. Looking at this data, no one is going to get fired for saying “I think we need to buy more hardware so we don’t run out of capacity.” In fact, in most organisations, that is exactly what happens, especially when developers realise the application doesn’t scale in performance testing. This scenario is certainly also a valid use case for auto-scaling in the cloud where CPU can be provisioned on-the-fly in minutes. However, as I’m sure many of you are aware, buying more hardware just prolongs the pain and glosses over the real underlying issues of the application, rather than the infrastructure.
A much better question to ask in this example would be: “What in the application is responsible for burning that CPU?” Sounds obvious, right? But it’s not that simple because the only visibility IT Operations has is to see which processes (e.g. Java.exe) consume the most CPU on each server, They can’t go any deeper than identifying which JVMs are burning more CPU than others. This is where Application Performance Management (APM) tools such as AppDynamics can help.
Below is another screenshot showing the performance and health of all the business transactions (user requests) that flowed through the e-commerce application in production, along with classifying how many types of business transactions were slow, very, stalled and erroneous. The most relevant metric here in this story is the “CPU Used” column to the far right, which shows the average CPU burn (milliseconds) for each business transaction. You can see that one transaction, “Search” stands out with a high average CPU burn of 797ms per transaction. We can also see Search has several thousand requests that were either classified as slow or very slow, indicating a significant deviation from its normal performance baseline.
Let’s take a look at one slow Search transaction in more detail:
The above screenshot shows that this slow Search business transaction took 12.6 seconds to execute, and over 4.4 seconds of that time was spent burning CPU. 4.4 seconds of CPU for a simple search transaction. How is that possible?? Let’s take a look at the hot spots for this search transaction to understand what application code was contributing to this CPU burn:
The above screenshot shows that for each search result, the application has to retrieve product details information back using standard Java Beans. This all sounds fine–that is, right up until you look at the code execution for each getProduct() method call, which is called over 15 times for this Search transaction:
Holy EJB Appman! You can see every getProduct() call makes several EJB calls, which in turn pull product data from the database using multiple JDBC calls. Unfortunately the code execution (and latency) above happens for every product in the search results. It’s therefore no surprise that this transaction burns significant CPU cycles, as it uses hundreds of EJB and JDBC calls to persist data back to the user. Operations could communicate data like this back to developers so they could optimize application logic and ensure that transactions like these use significantly less CPU, thus minimizing capacity spikes and any associated business transaction latency along the way. The last thing you want is a crap end user experience because your servers are on their knees processing inefficient application code which developers are responsible for.
If you’re thinking of moving to the Cloud anytime soon, you might want to check which business transactions in your application are responsible for high CPU burn. If one or two transactions are burning 80% of your total CPU resource, then that’s an easy opportunity to save money–rather than to spend it in the Cloud.
If you want to know how much CPU your applications and business transactions burn, take a free 30-day trial of AppDynamics Pro and get started today. You might end up making your application scale without spending needless money on hardware – FamilySearch did and managed a 10X improvement in throughput and performance of their application.
So long suckers!
Link to this post:
Everyday in our life we rely on services provided by other people. Making a phone call, getting a car fixed, or ordering a pizza – and yet we want those things to happen as quickly as possible, because time often means money. If you take your car to a Mercedes or BMW dealer, you will understand this point better than anyone. Our productivity (and often happiness) is therefore controlled, everyday, by different organizations and people. When things slow down or don’t happen we get upset, frustrated, and sometimes rant on twitter like these folk:
If your application today has SOA design principles, is heavily distributed and relies on 3rd party service providers, then you’ve probably become frustrated at some point when your application slows down or crashes. The problem is this: your end user experience and quality of service (QoS) is only as good as the QoS of your service providers. So, unless you monitor QoS you can’t measure QoS–and if you can’t measure QoS, you can’t manage your service providers and your end user experience. For example, take a look at this customer e-commerce application which has 7 JVM’s, 1 database and 7 external web service providers:
This customer recently had a slowdown with their e-commerce production application. After a few minutes browsing AppDynamics, they successfully identified that one of their web service providers was having latency issues (AppDynamics automatically baselines performance and flags deviations for each web service provider as shown in the above screenshot). The customer called their service provider, and sure enough the service provider admitted to having issues. A few hours later the service provider called back and said “we fixed the problem, everything should be back to normal”–yet the customer could clearly see latency issues still occurring in AppDynamics. So they sent their service provider a screenshot showing the evidence. The service provider then checked again, and called back a few minutes later saying “Yes, sorry a few customers are still being impacted.” Without this level of visibility, many organizations are simply blind to how external service providers impact their end user experience and business.
Being able to troubleshoot slow performance in minutes is helpful, but what about being able to report the exact service level you receive–say, from each of your service providers over a period of time? Did your service improve over time or did it regress? How many outages or severity 1 incidents did your service providers cause this week for your application?
Take the below screenshot from AppDynamics, which plots the maximum response time for five different web services consumed by an application over the last week. You can see that three out of the five web services (denoted by pink, blue and turquoise lines) consistently deliver sub-second response times and provide a great service level. However, the other two web services (red and green lines) show performance spikes with response times of between 14 and 22 seconds. The green web service in particular is very inconsistent and shows several performance spikes in two days.
Below is the response time of another web service (PayPal) for a customer application over the last 3 months. Notice the spikes in response time and look at the deviation between average and maximum response time over the time period. What’s impressive is that despite the occasional service blip, the PayPal service has slowly improved by 14% from 450 milliseconds to around 385 milliseconds. It’s also been very stable the last few weeks, along with having a consistent service (small deviation from average and maximum response time).
If your application relies on one or more 3rd party web services, you should periodically check and report what level of service you are receiving each week. That way, you can truly understand your service provider QoS and its impact on your end user experience and application performance. You can also keep your service providers honest, with complete visibility of whether QoS is improving or degrading over time as service outages occur and are fixed.
The next time you experience a slow down or outage in your application, you should first check external web services before you start to troubleshoot your own. The last thing you want to be doing is debugging your own code, when it could be someone else’s service and code that is causing the issue. Using AppDynamics it’s possible to monitor, measure, and manage the QoS from each of your web service providers. You can get started right now by downloading AppDynamics Lite (our free edition) for a single JVM or IIS web server, or you can request a 30-day trial of AppDynamics Pro (our commercial edition) for Java or .NET applications with multiple JVMs and IIS web servers.Link to this post: