Root cause diagnosis remains one of the most difficult issues with modern distributed application environments. A performance issue such as slow end-user response time or slow system to system response time could reside in hundreds, even thousands of places within your environment. How can you narrow down the search and discover the root cause?
When issues lie outside of your application and reside in the cloud, Microsoft OS, any third-party application, or a myriad of other places — the search and diagnosis becomes a daunting task. These issues would be nearly impossible to detect without Machine Snapshots — which capture the state of a server at a specific moment in time — and AppDynamics since nothing in your code or backend would show an issue, you would just have an inexplicably-slow response time.
Making matters worse, you can’t replicate the problem.
Two approaches to root cause discovery
Start with the user experience and transaction response time
Using Machine Snapshots, you can see which processes are contributing to the high response time spike and investigate what’s happening on each node. These snapshots give you an x-ray vision vision of how your environment is performing at any given time and allow you to drill down with code-level visibility.
If necessary, drill down to the infrastructure level
Attempting to first diagnose at the infrastructure level is an archaic method which will take a countless amount of time. Previously, IT professionals would start at the infrastructure level and work inwards, becoming more granular. With SOA and modern environments this method is inefficient because there can we thousands of nodes on the infrastructure level. However, Machine Snapshots allow you to drill down to the process level and make performance decisions to tune the overall performance.
I decided to implement my hypothesis inside a test environment to show how Machine Snapshots can immensely help with diagnosis in a real-life situation.
In this test scenario, we’ve created a lot of environment problems such as server health, response time spike, throughput dropped, and an increase in error rates. All issues were affected at the same time.
When you drill down to the error details you see all transactions are affected and failing due to a connection timeout to the database as it runs on the same server.
With every Snapshot, our Machine Agents collect all metrics including memory and CPU usage. In this instance, you can see the hardware metrics showing extremely high CPU usage.
We’d like to drill down into the machine and find out what’s going on, globally, to cause the high CPU usage. Using Machine Snapshots, we can get granular analysis into every process to see which is the root cause.
As you can see, several processes are using 100% CPU at the same time. The backup processes are killing the server, causing the overall performance issue.
The chart below shows when CPU gets to 100% your customers are abandoning their purchases, hurting your bottom line. As an Ops or DevOps professional, this is the only metric that will matter to your suited business counterpart, however, it’s important to note in this instance as response time rises your revenue drops.
Your business no longer runs on software, your business is software.
Finally, the root cause is discovered — it ends up being a backup utility causing the entire application to run slowly. With AppDynamics and Machine Snapshots you’re able to monitor your environment beyond your typical applications. As one of my customers said, “it’s like having a task manager on the remote server at the moment when the CPU or memory usage are high.”
With Machine Snapshots, a previously tedious and painstaking task becomes a little bit easier to manager. Save time, save money, and get back to doing your job.
Test drive a FREE trial of AppDynamics and see the power of Machine Snapshots today!