TAG | Java Monitoring
Everyone and their mother is talking about big data these days – how to manage it, how to analyze it, how to gain insight from it – but very few organizations actually have big data that they have to worry about managing or analyzing. That’s not the case for FamilySearch, the world’s largest genealogy organization. FamilySearch has 10 petabytes of census records, photographs, immigration records, etc. in its database, and its data grows every day as volunteers upload more documents. Ironically, this organization that’s tasked with cataloging our past is now on the forefront of the big data trend, as they’re being forced to find new and innovative ways to manage and scale this data.
From 2011 to 2012, FamilySearch scaled almost every aspect of their application, from data to throughput to user concurrency. According to Bob Hartley, Principal Engineer and Development Manager at FamilySearch, AppDynamics was instrumental in this project. Hartley estimates that FamilySearch saved $4.8 million over two years by using AppDynamics to optimize the application instead of scaling infrastructure. That’s a pretty big number, so we broke it down for you in this infographic:
Embed this image on your site:
How FamilySearch Scaled
- From 11,500 tpm to 122,000 tpm
- From 6,000 users per minute to 12,000 users per minute
- From 12 application releases per year to 20 application releases per year
- From 10 PB of data to approaching 20 PB of data
- No additional infrastructure
- Response time reduced from minutes to seconds
- 227 Severity-1 incidents/year took 33 hours each to troubleshoot
- 300 pre-production defects per year took 49 hours each to troubleshoot
- This amounts to a total of 36,891 man-hours spent on troubleshooting every year
FamilySearch estimates that they saved $4.8 million with AppDynamics in two years. That’s a huge number, so let’s break it down:
- FamilySearch would have had to purchase 1,200 servers at approx. $1,000 each, amounting to $1,200,215 in savings
- Those 1,200 servers would cost $2,064,370 in power and air conditioning
- Those 1,200 servers would cost $200,000 in administrative costs over two years
FamilySearch estimates that they’ve reduced troubleshooting time for both pre-production defects and production incidents by 45%, amounting to $885,170 in savings for pre-production and $460,836 in savings for production incidents (based on average salaries for those positions).
To learn more about what FamilySearch accomplished and how they use AppDynamics, check out their case study and Bob Hartley’s video interview on the FamilySearch ROI page.Link to this post:
Imagine you’re an operations guy and you’ve just received a phone call or alert notifying you that the application your responsible for is running slow. You bring up your console, check all related processes, and notice one java.exe process isn’t using any CPU but the other Java processes are. The average sys admin at this point would just kill and restart the Java process, cross their fingers, and hope everything returns back to normal (this actually does work most of the time). An experienced sys admin might perform a kill -3 on the Java process, capture a thread dump, and pass this back to dev for analysis. Now suppose your application returns back to normal–end users stop complaining, you pat yourself on the back and beat your chest, and basically resume what you were doing before you were rudely interrupted.
The story I’ve just told may seem contrived, but I’ve witnessed it several times with customers over the years. The stark reality is that no one in operations has the time or visibility to figure out the real business impact behind issues like this. Therefore, little pressure is applied to development to investigate data like thread dumps so that root causes can be found and production slowdowns can be avoided again in future. It’s true restarting a JVM or CLR will solve a fair few issues in production, but it’s only a temporary fix over the real problems that exist within the application logic and configuration.
Now imagine for one minute that operations could actually figure out the business impact of production issues, along with identifying the root cause, and communicate this information to Dev so problems could be fixed rapidly. Sounds too good to be true, right? Well, a few weeks ago an AppDynamics customer did just that and the story they told was quite compelling.
Code Deadlock in a distributed E-Commerce Application
The customer application in question was a busy e-commerce retail website in the US. The architecture was heavily distributed with several hundred application tiers that included JVMs, LDAP servers, CMS server, message queues, databases and 3rd party web services. Here is a quick glimpse of what that architecture looked like from a high level:
Detecting Code Deadlock
If we look at the AppDynamics problem pane (right) as the customer saw things, it shows the severity of their issues. During the day the application was experiencing just over 4,000 business transactions per minute, which works out at just under 1 million transactions a day. Approximately 2.5% of these transactions were impacted by the slowdown, which was the result of the 92 code deadlocks you see here that occurred during peak hours.
AppDynamics is able to dynamically baseline the performance of every business transaction type before classifying each execution as normal, slow, very slow or stalled depending on its deviation from its unique performance baseline. This is critical for understanding the true business impact of every issue or slowdown because operations can immediately see how many user requests were impacted relative to the total requests being processed by the application.
From this pane, operations were able to drill down into the 92 code deadlocks and see the events that took place as each code deadlock occurred. As you can see from the screenshot (below left), the sys admins during the slowdown kept restarting the JVMs (as shown) to try and make the issues go away. Unfortunately, this didn’t work given that the application was experiencing high concurrency under peak load.
By drilling into each Code Deadlock event, operations were able to analyze the various thread contentions and locate the root cause of the issue. The root cause of the slowdown turned out to be an application cache which wasn’t thread-safe. If you look at the screenshot below, showing the final execution of the threads in deadlock accessing the cache, you can see that one thread was trying to remove an item, another was trying to get an item, and the last thread was trying to put an item. 3 threads were trying to do a put, get and remove at the same time! This caused a deadlock to occur on cache access, thus causing the related JVM to hang until those threads were released via a restart.
Analyzing Thread Dumps
Below you can see the thread dump that AppDynamics collected for one of the code deadlocks, which clearly shows where each thread was deadlocked. By copying the full thread dumps to clipboard, operations were able to see the full stack trace of each thread, thus identifying which business transactions, classes, and methods were responsible for cache access.
The root cause for this production slowdown may have been identified and passed to dev for resolution, but the most compelling conclusion from this customer story was related to them identifying the real business impact that occurred. The application was clearly running slow, but what did the end user experience during the slowdown and what impact would this have had on the business?
What was the Actual Business Impact?
The screenshot below shows all business transactions that were executing on the e-commerce web application during the five hour window before, during, and after the slowdown occurred.
Here are some hard hitting facts for the two most important business transactions inside this e-commerce application:
- 46,463 Checkouts processed
- 482 returned an error, 1325 were slow, 576 were very slow and 111 stalled.
- 3,956 Payments processed
- 12 returned an error, 242 were slow, 96 were very slow and 79 stalled
Error – transaction failed with an exception. Slow – the business transaction deviated from its baseline by more than 3 standard deviations. Very Slow – the business transaction deviated from its baseline by more than 4 standard deviations. Stalled – the transaction timed out.
If you take these raw facts and assume the average revenue per order is $100, then the potential revenue risk/impact of this slowdown was easily into six digits when you consider the end user experience for checkout and payment. Even if you take the 482 Errors and 111 Stalls relating to the Checkout business transaction alone – this still equates to around $60,000 of revenue at risk. And that’s a fairly conservative estimate!
If you add up all the errors, slow, very slow and stalls you see in the screenshot above, you start to picture how serious this issue was in production. The harsh reality is that incidents like this happen everyday in production environments, but no one has visibility into the true business impact of them, meaning little pressure is applied to development to fix “glitches.”
Agile isn’t about Change, It’s about Results
If development teams want to be truly agile, they need to forget about constant change and focus on what impact their releases has on the business. The next time your application slows down or crashes in production, ask yourself one question: “What impact did that just have on the business?” I guarantee just thinking about that answer will make you feel cold. If development teams found out more often the real business impact of their work, they’d learn pretty quickly how fast, reliable and robust their application code really is.
I’m pleased to say no developers were injured or fired during the making of this real-life customer story; they were simply educated on what impact their non-thread safe cache had on the business. Failure is OK–that’s how we learn and build better applications.
App Man.Link to this post:
What happens when mission critical Java applications slow down or keep crashing in production? The vast majority of IT Operations (Ops) today bury their heads in log files. Why? because thats what they’ve been doing since IBM invented the mainframe. Diving into the weeds feels good, everyone feels productive looking at log entries, hoping that one will eventually explain the unexplainable. IT Ops may also look at system and network metrics which tell them how server resource and network bandwidth is being consumed. Again, looking at lots of metrics feels good but what is causing those server and network metrics to change in the first place? Answer: the application.
IT Ops monitor the infrastructure that applications run on, but they lack visibility of how applications actually work and utilize the infrastructure. To get this visibility, Ops must monitor the application run-time. A quick way to get started is to use the free tools that come with the application run-time. In the case of Java applications, both JConsole and VisualVM ship with the standard SDK and have proved popular choices for monitoring Java applications. When we built AppDynamics Lite we felt their was a void of free application monitoring solutions for IT Ops, the market had plenty of tools aimed at developers but many were just too verbose and intrusive for IT Ops to use in production. If we take a look at how JConsole, VisualVM and AppDynamics Lite compare, we’ll see just how different free application monitoring solutions can be.