Death by Garbage Collection

image_pdfimage_print

Every man and his dog knows that Garbage Collection (GC) is a “stop the world” event. Java memory becomes filled with redundant objects/data which over time must be collected so that memory can be reclaimed and reused by the JVM. When the GC thread in the JVM kicks in all other threads grind to a halt, giving the end user an impression that the application is slow. It’s not such a big deal when GC runs for a few seconds, say, every minute, as this is typical for an active application that persists data frequently from databases (disk) to JVM (memory). The problem is when GC takes more than a few seconds, especially now that JVM memory and heaps can be as large as 16GB – collecting MBs can take seconds, but GBs can take minutes. I was troubleshooting a performance issue with a customer’s application last week and witnessed significant GC times of around 90 seconds. Here is what I found while troubleshooting.

The customer reported an application slowdown around 7pm in the evening. This application processes on average around 25 million business transactions a day, which is roughly 18,000 transactions a minute. It has 20+ JVMs, 2 relational databases, an MQ messaging backbone and several 3rd party  web services. Here is the application topology (system map) showing the traffic and performance spike:

 

The first place I checked was the system OS metrics, specifically looking for high CPU, memory, disk I/O and network I/O – the classic KPIs a typical sys admin might check. Everything looked OK:

So I then looked at JVM metrics to check CPU, memory (heap) and garbage collection activity. This is where I noticed one JVM was spending a significant amount of time in GC, an average of 23 seconds per minute compared, to the others which were averaging 2-3 seconds. You can see from the screenshot below that one JVM had 17 major collections with most other JVMs just having a single collection:

Next I drilled down into this JVM and plotted its GC time, % CPU Busy and % Heap utilization before, during and after the reported slowdown at 7pm. You can see in the screenshot below the sudden leap in GC time (top green line) from around 5 seconds per minute to well over 100 seconds per minute. Also notice how the heap (red line) has because exhausted and that CPU spiked when GC kicked in as expected.

The question I asked myself was “what in the application caused this excessive GC?”. Sure, I could have just upped the heap size on this JVM to prolong the inevitable death by GC, but this was a mission-critical application we’re talking about, and band-aids are not that fashionable these days. Besides, increasing JVM Heap can often make GC worse when it eventually kicks in. The next thing I did was analyze the performance and resource consumption of business transactions that were running through the offending JVM.

The screenshot below shows that the Search business transaction (circled) had an average response time of 2 seconds (good) but it also had a maximum response time of 267 seconds (not good). Combined with the fact that this transaction is invoked over 60 times a minute, had hundreds of errors, slow requests and stalls, and was consuming the most amount of CPU (averaging > 600ms per transaction), I decided to investigate this transaction further.

I plotted (see below) the response time (pink line) of the search transaction and its calls per minute (blue line) with the earlier chart of JVM GC time, % CPU Busy and % Heap utilized. You can see that as calls per minute for the search transaction reaches 10, GC time begins to increase along with % CPU busy and a spike in the search transaction response time.

I then took a look at the code execution of a search business transaction to get a sense of what might be causing GC to suddenly kick in. The search transaction below took 38.7 seconds to complete, of which 40% of that time was spent burning CPU in the JVM.

Looking at the code execution and hotspots for this specific search transaction you can see immediately the excessive number of EJB calls and subsequent JDBC queries. The search transaction was basically creating a ProductDataBean for every search result, with every Bean persisting data from the database into JVM memory.

 It was actually worse than I thought, though. I naively assumed each Bean would be doing a single query to get the data it needed per product. The reality (below screenshot) is that each Bean was invoking hundreds of SQL queries. In total, for a single search transaction taking 38,7 seconds, over 12,000 SQL queries were invoked by the application in the JVM. You can see the average response time of each query is around 1 milliseconds (very fast) but the fact many queries were called thousands of time meant the cumulative response time was high, as shown.

 Yes, the search transaction was slow on this occasion, but the number of objects being instantiated and the amount of data being persisted by each search transaction was the clear issue here. The broader a search, the more search results returned and the more objects/data being persisted in the JVM, the more often GC will occur. A big problem with a transaction like search is that many users will simply search again or refresh their search hoping that their new search will be faster. I’ve seen this in CRM applications where call center agents will be talking to a customer while they try to retrieve the customer’s details. After 10 seconds the agent will become inpatient and perform the same customer search again. This just makes things worse, as the user kicks off another transaction which then performs the same workload as the previous search. In the customer application example above, every search can potentially exhaust the JVM heap depending on the concurrency of the number of results returned by each search. The recommendation for the above customer’s application is to optimize the objects being created to limit the impact of concurrency.

Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

  • Benjamin Wootton

    The first paragraph of this article lays it on a bit thick.

    GC doesn’t have to be stop the world, and even where it is, parallel generational GCs are obviously much more sophisticated than simply “grinding all non-GC threads to a halt” for a long cleanup.

    If you are creating 16gb of heaps of short lived data then you are doing something wrong!

Copyright © 2014 AppDynamics. All rights Reserved.