CAT | The Usual Suspects
Every man and his dog knows that Garbage Collection (GC) is a “stop the world” event. Java memory becomes filled with redundant objects/data which over time must be collected so that memory can be reclaimed and reused by the JVM. When the GC thread in the JVM kicks in all other threads grind to a halt, giving the end user an impression that the application is slow. It’s not such a big deal when GC runs for a few seconds, say, every minute, as this is typical for an active application that persists data frequently from databases (disk) to JVM (memory). The problem is when GC takes more than a few seconds, especially now that JVM memory and heaps can be as large as 16GB – collecting MBs can take seconds, but GBs can take minutes. I was troubleshooting a performance issue with a customer’s application last week and witnessed significant GC times of around 90 seconds. Here is what I found while troubleshooting.
The customer reported an application slowdown around 7pm in the evening. This application processes on average around 25 million business transactions a day, which is roughly 18,000 transactions a minute. It has 20+ JVMs, 2 relational databases, an MQ messaging backbone and several 3rd party web services. Here is the application topology (system map) showing the traffic and performance spike:
The first place I checked was the system OS metrics, specifically looking for high CPU, memory, disk I/O and network I/O – the classic KPIs a typical sys admin might check. Everything looked OK:
So I then looked at JVM metrics to check CPU, memory (heap) and garbage collection activity. This is where I noticed one JVM was spending a significant amount of time in GC, an average of 23 seconds per minute compared, to the others which were averaging 2-3 seconds. You can see from the screenshot below that one JVM had 17 major collections with most other JVMs just having a single collection:
Next I drilled down into this JVM and plotted its GC time, % CPU Busy and % Heap utilization before, during and after the reported slowdown at 7pm. You can see in the screenshot below the sudden leap in GC time (top green line) from around 5 seconds per minute to well over 100 seconds per minute. Also notice how the heap (red line) has because exhausted and that CPU spiked when GC kicked in as expected.
The question I asked myself was “what in the application caused this excessive GC?”. Sure, I could have just upped the heap size on this JVM to prolong the inevitable death by GC, but this was a mission-critical application we’re talking about, and band-aids are not that fashionable these days. Besides, increasing JVM Heap can often make GC worse when it eventually kicks in. The next thing I did was analyze the performance and resource consumption of business transactions that were running through the offending JVM.
The screenshot below shows that the Search business transaction (circled) had an average response time of 2 seconds (good) but it also had a maximum response time of 267 seconds (not good). Combined with the fact that this transaction is invoked over 60 times a minute, had hundreds of errors, slow requests and stalls, and was consuming the most amount of CPU (averaging > 600ms per transaction), I decided to investigate this transaction further.
I plotted (see below) the response time (pink line) of the search transaction and its calls per minute (blue line) with the earlier chart of JVM GC time, % CPU Busy and % Heap utilized. You can see that as calls per minute for the search transaction reaches 10, GC time begins to increase along with % CPU busy and a spike in the search transaction response time.
I then took a look at the code execution of a search business transaction to get a sense of what might be causing GC to suddenly kick in. The search transaction below took 38.7 seconds to complete, of which 40% of that time was spent burning CPU in the JVM.
Looking at the code execution and hotspots for this specific search transaction you can see immediately the excessive number of EJB calls and subsequent JDBC queries. The search transaction was basically creating a ProductDataBean for every search result, with every Bean persisting data from the database into JVM memory.
It was actually worse than I thought, though. I naively assumed each Bean would be doing a single query to get the data it needed per product. The reality (below screenshot) is that each Bean was invoking hundreds of SQL queries. In total, for a single search transaction taking 38,7 seconds, over 12,000 SQL queries were invoked by the application in the JVM. You can see the average response time of each query is around 1 milliseconds (very fast) but the fact many queries were called thousands of time meant the cumulative response time was high, as shown.
Yes, the search transaction was slow on this occasion, but the number of objects being instantiated and the amount of data being persisted by each search transaction was the clear issue here. The broader a search, the more search results returned and the more objects/data being persisted in the JVM, the more often GC will occur. A big problem with a transaction like search is that many users will simply search again or refresh their search hoping that their new search will be faster. I’ve seen this in CRM applications where call center agents will be talking to a customer while they try to retrieve the customer’s details. After 10 seconds the agent will become inpatient and perform the same customer search again. This just makes things worse, as the user kicks off another transaction which then performs the same workload as the previous search. In the customer application example above, every search can potentially exhaust the JVM heap depending on the concurrency of the number of results returned by each search. The recommendation for the above customer’s application is to optimize the objects being created to limit the impact of concurrency.
Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.
As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.Link to this post:
My wife is a shopoholic and serial checkout killer. Every week she spends several hours browsing and carefully adding items to shopping carts. Every now and then she’ll interrupt my sports program with an important announcement “My Checkout just failed”. Take this example Mrs Appman sent me during the month of September:
The fact I work in the APM industry means it is my responsibility to fix these problems immediately (for my wife). As Black Friday is coming up I thought I’d share with you what typically goes wrong under the covers when our customer’s e-commerce applications go bang during a critical time.
It is worth mentioning that nearly all our customers perform some form of performance or load testing prior to the Black Friday period. Many actually simulate the load from the previous year on test environments designed to reproduce Black Friday activity. While 75% of bottlenecks can be avoided in testing, unfortunately a few surface in production as a result of applications being too big and complex to manage to test under real world scenarios. For example, most e-commerce applications these days span more than 500 servers distributed across many networks and service providers.
Here are the top 10 reasons why eCommerce applications will fail this Black Friday:
1. Database Connection Pool
Nearly every checkout transaction will interact with one or more databases. Connections to this resource are therefore sacred and can often be deadly when transaction concurrency is high. Most application servers come with default connection pool configurations of between 10 and 20. When you consider that transaction throughput for e-commerce applications can easily exceed 100,000 trx/min you soon realize that default pool configurations aren’t going to cut it. When a database connection pool becomes exhausted incoming checkout requests simply wait or timeout until a connection becomes available. Take this screenshot for example:
2. Missing Databases Indexes
This root cause is somewhat related to the exhausted connection pools. Simply put, slow running SQL statements hold onto a database connection for longer, therefore connection pools aren’t recycled as often as they should be as queries take longer. The number 1 root cause of slow SQL statements is missing indexes on database tables, which is often caused by miss-communication between developers who write SQL, and the DBAs who configure and maintain the database schemas which hold the data. The classic “full table scan” query execution where a transaction and its database operation must scan through all the data in a table before a result is returned. Here is an example of what such looks like in AppDynamics:
3. Code Deadlock
High transaction concurrency often means application server threads have to contend more for application resource and objects. Most e-commerce applications have some form of atomicity build in to their transactions, so that order and stock volumes are kept in check as thousands of users fight over special offers and low prices. If access to application resource is not properly managed some threads can end up in deadlock, which can often cause an application server and all its user transactions to hang and timeout. One example of this was last year where an e-commerce customer was using a non-thread safe cache. Three threads tried to perform a get, set and remove on the same cache at the same time causing code deadlock to occur, impacting over ~2,500 checkout transactions as the below screenshot shows.
4. Socket Timeout Exceptions
Server connectivity is an obvious root cause, if you check your server logs using a Sumologic or Splunk then you’ll probably see hundreds of these events. They represent network problems or routing failures where a checkout transaction is attempting to contact one or more servers in the application infrastructure. Most of the time the services you are connecting to aren’t your own, for example a shipping provider, credit card processor, or fraud detector. On high traffic days like Black Friday it isn’t just your site experiencing a surge in traffic – often times entire networks are saturated due to intense demand. After a period of time (often 30-45 secs) the checkout transaction will just give up, timeout and return an error to the user. No Connectivity = No Revenue. Here is an example of what it looks like:
5. Garbage Collection
Caches are an easy way to speed up applications. The closer data is to application logic (in memory) the faster it executes. It is therefore no surprise that as memory has gotten bigger and cheaper most companies have adopted some form of in-memory caching to eliminate database access for frequent used results. The days of 64GB and 128GB heaps are now upon us which means the impact of things like Garbage Collection are more deadly to end users. Maintaining cache data and efficiently creating/persisting user objects in memory becomes paramount for eliminating frequent garbage collection cycles. Just because you have GB’s of memory to play with doesn’t mean you can be lazy in how you create, maintain and destroy objects. Here is are a few screenshots that show how garbage collection can kill your e-commerce application:
6. Transactions with High CPU Burn
Its no secret than inefficient application logic will require more CPU cycles than efficient logic. Unfortunately the number 1 solution to slow performance in the past was for eCommerce vendors to buy more servers. More servers = More Capacity = More Transaction Throughput. While this calculation sounds good, the reality is that not all e-commerce transactions are CPU bound. Adding more capacity just masks inefficient code in the short term, and can waste you significant amounts of money in the long term. If you have specific transactions in your eCommerce application that hog or burn CPU then you might want to consider tuning those before you whip out your check book with Oracle or Dell. For example:
7. 3rd Party Web Services
If your e-commerce application is built around a distributed SOA architecture then you’ll have multiple points of failure. Especially if several of those services are provided by a 3rd party where you have no visibility. For example, most payment and credit card authorization services are provided by 3rd party vendors like PayPal, Stripe, or Braintree. If these services slow down or fail then its impossible for checkout transactions to complete. You therefore need to monitor these services religiously so when problems occur you can rapidly identify whether it is your code or connectivity or someone else’s outage. Here is example of how AppDynamics can help you monitor your 3rd party web services:
8. Crap Recursive Code
This is similar to #6 but burns time instead of resources. For example, many e-commerce transactions will request data from multiple sources (caches, databases, web services) at the same time. Every one of these round trips could be expensive and may involve network time along the way. I’ve seen a single eCommerce search transaction call the same database multiple times instead of performing a single operation using a stored procedure on the database. Recursive remote calls may only take 10-50 millisecond each, but if they are invoked multiple times per transaction they can add seconds to your end user experience. For example, here is that search transaction that took x seconds and made 13,000 database calls.
9. Configuration Change
As much as we’d like to think that production environments are “locked down” with change control process, they are not. Accidents happen, humans make mistakes and hotfixes occasionally get applied in a hurry at 2am Application server configuration can be sensitive just like networks, or any other pieces of the infrastructure. Being able to audit, report and compare configuration change across your application gives you instant intelligence that a change may have caused your eCommerce application to break. For example, AppDynamics can record any application server change and show you the time and values that were updated to help you correlate change with slowdowns and outages, see below screenshot.
10. Out of Stock Exception
“I’m sorry, the product you requested is no longer in stock”. This basically means you were too slow and you’ll need to wait until 2014 for the same offer. Remember to set an alarm next year for Black Friday
In addition, AppDynamics can also monitor the revenue and performance of your checkout transactions over-time which helps Dev and Ops teams monitor and alert on the health of the business:
The good news is that AppDynamics Pro can identify all of the above defects in minutes. You can take a free trial here and be deployed in production in under 30 minutes! If you send us a few screenshots of your findings in production like the above we’ll send you a $250 Amazon gift certificate for your hard work!
Steve.Link to this post:
The reason for this blog is purely down to a real-life incident which one of our e-commerce customers shared with us this week. It’s based around a use case that pretty much anyone can relate to – the moment your checkout transaction spectacularly fails. You sit there, looking at a big fat error message and think “WTF – did my transaction complete or did the company steal my money?” A minute later you’re walking a support team through exactly what happened: “I just clicked Checkout and got an error…honestly…I waited and never got a response.”
What’s different in this story is that the support team had access to AppDynamics as they were talking to a customer on the phone…and the customer got to find out the real reason their checkout failed. How often does that happen? Never, until now. Here is the story as documented by the customer.
Apple has done a stellar job with their development platform and iOS. In fact, they’ve done a stellar job turning my living room into an apple showroom. If you asked me 10 years ago whether my laptop, mouse, keyboard, monitor, phone, music player, TV and tablet would be colored white with an Apple logo I would have probably laughed in your face. The only Microsoft thing left in my house now is an XBOX, and it won’t be long before that turns white as well. Being married also presents a problem in that I now have two of everything, because sharing isn’t caring when it comes to Apple gadgets. With Apple technology being “cool” and widely adopted by millions of users, you can see why every business is migrating their applications to iOS for an improved end user experience. One of our customers recently made the move, and here’s a story of how their new iPhone app crashed their entire mission-critical web application….and I bet you weren’t expecting me to say that, were you?
An unusual spike in performance
Below is screenshot from AppDynamics that shows monitoring data for the customers online web application over the last month. The application has approximately 250 IIS instances, a dozen databases, a dozen web services and a distributed cache.