As application health continues to become more automated, APM is going beyond the boundaries of human capability. Now, administrators can utilize the power of AI in root cause analysis to monitor the performance of applications 24/7.
[Learn more about how AI can help you deliver a flawless user experience with Cognition Engine.]
What is root cause analysis (RCA)?
Imagine your application is 100 haystacks that represent tiers, and in those haystacks somewhere there’s a needle that’s hurting your end user experience. As an administrator, you need to find it and eliminate it as quickly as possible. The problem is, each haystack has over half a million pieces of hay representing lines of code in your application. It’s therefore no surprise that organizations can take days or weeks to find the root cause of performance issues in today’s complex, distributed environments.
And that’s why it’s no longer enough to identify unhappy users (EUM), slow business transactions (application mapping), and problematic haystacks (tiers) in your application — you need to find the needles and that takes code-level visibility in all layers of the stack through the application, business, and user experience down through to the infrastructure and network. EUM and application mapping will help you isolate the performance pain, but they won’t tell you the root cause so you can resolve it. You want to understand not just what happened, but why it happened.
The solution is called root cause analysis (RCA), a concept first developed by Sakichi Toyoda in 1958 as a part of Toyota’s manufacturing process and since adopted by almost every industry out there, from publishing to engineering. In the case of application performance management, it’s a step in the APM process designed to reduce mean time to resolution (MTTR) for application performance problems. RCA follows anomaly detection in the process of triaging and resolving performance issues. After detecting the issue, stakeholders can begin RCA in one of two ways:
By creating a war room to investigate the current historical system, recreate the timeline of when the anomaly began and what occurred afterward, and sort through multiple errors to figure out what underlying defect most likely caused this event
By using artificial intelligence (AI) and machine learning (ML) to create a complete anomaly timeline automatically, monitor data streams in real time, use historical and contextual correlation to quickly pinpoint the cause so we can move onto identifying the required fix or reconfiguration to resolve the issue.
Using AI/ML to automatically point to the source of the problem empowers IT profs to troubleshoot faster and eliminate any guesswork in determining what is causing problems with your app’s health. ML can see and correlate across your entire IT environment to identify a wider range of problems than humans could realistically capture. And it provides the much-needed context behind application and business health.
The TLDR; IT professionals use root cause analysis to identify and resolve issues, and they apply AI/ML to find and fix those issues faster and before they affect the end user. For the purposes of this post, we’ll explore how automation supports the process.
What’s the process of root cause analysis?
It’s all very well resolving problems — but first you have to know what constitutes a problem and filter out any false positive alerts to problems that don’t meet those parameters. Is that slow response time in that key business transaction caused by a real issue, like an unexpected increase in traffic, or a known issue, like an increase in traffic during the busy season?
That’s why anomaly detection comes first. Anomaly detection uses machine learning algorithms to automatically define, and learn over time, what constitutes “normal” application behavior. That way you remove the stress of manual threshold-setting and can automatically filter out noise related to false positives, thereby preventing alert storms.
Once an anomaly is accepted as real, it’s time to really get to work.
Root cause analysis also uses machine learning — to determine the root cause of the performance problems revealed by anomaly detection. Where anomaly detection focuses on the symptoms, RCA focuses on the cause.
This is when machine learning starts to investigate further and show you the suspected causes for an anomaly. Maybe that slow response time was caused by slow third-party code. RCA discovers this in a two-step approach:
- Fault domain isolation: ML can zero in on the fault domain to identify the exact location of the problem without you having to trawl through logs and exactly what components were impacted.
- Impacted component analysis: Analysis of logs, snapshots, traces, infrastructure, and so on to determine the affected components.
Your APM solution should clearly expose the offending anomalies along with the top suspected causes and any contributing tiers, exit calls, or inter-tier network issues, so that you can more accurately diagnose the behavior and reduce repair time.
The point of using ML, rather than manual methods, is to triage issues to the right teams to take actions at exactly the right time. Good APM tools display these insights in a way that makes it easy to drill down into the problem to better understand where it came from and either negate the issue or take action, whether that’s CI/CD validation, cloud right-sizing, network optimization, or security enforcement.
Then you can get back to doing what matters most: innovating and the digital experience.
Why AI-powered root cause analysis is critical for problem-solving
There are many reasons to use AI-powered RCA:
- Aligns teams: Helps break down operational silos between application, infrastructure, and network teams by showing you exactly what is causing the issue and informing who should get involved. No more monitoring with blind spots.
- Lowers costs and saves time: By locating the exact line of code responsible for a performance issue and taking the guesswork of where and what to fix by whom, you can significantly reduce your MTTR/ troubleshoot in minutes as opposed to hours or days, get back the time and energy better spent on innovation. Significantly reduces cost and time spent by catching problems early so you can maintain an agile environment.
- Provides context: Modern APM tools put application issues into business context to help you prioritize and remediate actions based on what most impacts the business, solve business pain faster, setting off a feedback loop that informs future problem-solving.
- Implements long-lasting solutions: When you follow RCA all the way through to its conclusion (more later), you focus on long-term prevention and resolving technical debt. Solutions, not just speedy workarounds. This forward thinking fosters a proactive and productive culture.
- Grows your business: Effective RCA keeps customers happy, prevents lost revenue, and enables continuous development velocity and organizational efficiency that helps build a more resilient business and tech stack over time.
Getting started with automated root cause analysis
AI can’t do all the work. To ensure the process is efficient and meaningful:
Get started quickly
RCA should be done as soon after the incident as possible, when it’s fresh in everyone’s mind. The right data and metrics are important — you need enough information about the system to move forward — but so is human intelligence and different perspectives — because ultimately finding the root cause (which can vary in severity) requires methodical, organizational diligence and the right frame of mind.
Approach with an open mind
RCA challenges our assumptions about how the application works, what the network of dependencies look like, and the likely cause of an incident — and so it should. Assumptions get in the way — what you think you know about the application can cause you to ignore any evidence that contradicts the theory, making it impossible or time consuming to find the root cause. Instead, focus on getting the information you need to quickly form and verify a hypothesis. Stay open-minded and curious to what the root cause might be, and you’re more likely to approach it pragmatically using evidence to support your hypotheses. It’s also important that teams recognize that processes cause problems, not people, and placing blame achieves nothing.
Cast a wide and deep net
You’ll want to use ML to uncover as many possible factors — for example, not just the type of change, but a wide time frame in case the root cause happened way before the incident occurred. It can then drill down with granularity. The more granular your data, the more easily you’ll be able to identify the rectify the problem.
Understand the context
Context is critical. RCA tools need to not only capture and present data on how individual components of a system work, but also surface meaningful insights into how they interact with each other. Trace those correlations to find the root cause, connections between seemingly unrelated events, and create a map of these dependencies so you can understand exactly why a change in performance occurred and better avoid it in the future. The dependencies in modern applications are complicated and dynamic, and especially in larger organizations, technologists understand less about the application than they think.
Find solutions for the long term
Just knowing what the problem is and its cause is not enough — a critical part of RCA is to find solutions (whether corrective or preventative). And it’s not just correcting the initial issue, either. It’s about developing strategies to correct/prevent in the future, get better, taking the 30,000-foot view to how to correct the overarching issue.
Avoid knowledge silos
Probably the most common mistake is the over-reliance on knowledge silos. This happens when you don’t have robust observability tools in place to illuminate the big picture while zeroing in on the exact problem so the right teams can take action. But all this work is moot if you don’t share it with key stakeholders. It’s like gathering evidence from a crime scene and then never giving it to the proper authorities to make an arrest. Your APM solution should make it easy to report the right information to different audiences.
Close the loop and continuously improve
When all this is done, it isn’t the end. RCA done the right way is an iterative process. Quarterly or annual review of RCAs, actionable items, results makes the work even more meaningful. You should also revisit your RCA process from time to time and look for ways to improve it. A data-driven approach will increase the team’s understanding of how the application works and ensure that each fresh mystery is solved in a way that makes the application more resilient over time.
Root cause analysis plays an important role not just in general options but continuous improvement, especially with regards to the customer experience. The goal is to get to the bottom of an outage, slowdown, or other problem to protect the business and better understand how the application really works. It’s important to simplify root cause analysis. Allowing you to spend less time fixing problems before they become major production issues. And this saves more than money — the skills learned during RCA can be carried over to any other problem or field of IT that supports continuous improvement and innovation.