Part one of our series on deploying AIOPs identified how an anomaly breaks into two broad areas: problem time and solution time. Part two described the first deployment phase, which focuses on reducing problem time. With trust in the AIOps systems growing, we’re now ready for part three: taking on solution time by automating actions.
Applying AIOps to Mean Time to Fix (MTTFix)
The power of AIOps comes from continuous enhancement of machine learning powered by improved algorithms and training data, combined with the decreasing cost of processing power. A measured example was Google’s project for accurately reading street address numbers from its street image systems—a necessity in countries where address numbers don’t run sequentially but rather are based on the age of the buildings. Humans examining photos of street numbers have an accuracy of 98%. Back in 2011, the available algorithms and training data produced a trained model with 91% accuracy. By 2013, improvements and retraining boosted this number to 97.5%. Not bad, though humans still had the edge. In 2015, the latest ML models passed human capability at 98.1%. This potential for continuous enhancement makes AIOps a significant benefit for operational response times.
You Already Trust AI/ML with Your Life
If you’ve flown commercially in the past decade, you’ve trusted the autopilot for part of that flight. At some major airports, even the landings are automated, though taxiing is still left to pilots. Despite already trusting AI/ML to this extent, enterprises need more time to trust AI/ML in newer fields such as AIOps. Let’s discuss how to build that trust.
Apprenticeships allow new employees to learn from experienced workers and avoid making dangerous mistakes. They’ve been used for ages in multiple professions; even police departments have a new academy graduate ride along with a veteran officer. In machine learning, ML frameworks need to see meaningful quantities of data in order to train themselves and create nested neural networks that form classification models. By treating automation in AIOps like an apprenticeship, you can build trust and gradually weave AIOps into a production environment.
By this stage, you should already be reducing problem time by deploying AIOps, which delivers significant benefits before adding automation to the mix. These advantages include the ability to train the model with live data, as well as observe the outcomes of baselining. This is the first step towards building trust in AIOps.
Stage One: AIOps-Guided Operations Response
With AIOps in place, operators can address anomalies immediately. At this stage, operations teams are still reviewing anomaly alerts to ensure their validity. Operations is also parsing the root cause(s) identified by AIOps to select the correct issue to address. While remediation is manual at this stage, you should already have a method of tracking common remediations.
In stage one, your operations teams oversee the AIOps system and simultaneously collect data to help determine where auto-remediation is acceptable and necessary.
Stage Two: Automate Low Risk
Automated computer operations began around 1964 with IBM’s OS/360 operating system allowing operators to combine multiple individual commands into a single script, thus automating multiple manual steps into a single command. Initially, the goal was to identify specific, recurring manual tasks and figure out how to automate them. While this approach delivered a short-term benefit, building isolated, automated processes incurred technical debt, both for future updates and eventual integration across multiple domains. Ultimately it became clear that a platform approach to automation could reduce potential tech debt.
Automation in the modern enterprise should be tackled like a microservices architecture: Use a single domain’s management tool to automate small actions, and make these services available to complex, cross-domain remediations. This approach allows your investment in automation to align with the lifespan of the single domain. If your infrastructure moves VMs to containers, the automated services you created for networking or storage are still valid.
You will not automate every single task. Selecting what to automate can be tricky, so when deciding whether to fully automate an anomaly resolution, use these five questions to identify the potential value:
- Frequency: Does the anomaly resolution occur often enough to warrant automation?
- Impact: Are you automating the solution to a major issue?
- Coverage: What proportion of the real-world process can be automated?
- Probability: Does the process always produce the desired result, or can it be impacted by environmentals?
- Latency: Will automating the task achieve a faster resolution?
Existing standard operating procedures (SOPs) are a great place to start. With SOPs, you’ve already decided how you want a task performed, have documented the process, and likely have some form of automation (scripts, etc.) in place. Another early focus is to address resource constraints by adding front-end web servers when traffic is high, or by increasing network bandwidth. Growing available resources is low risk compared to restarting applications. While bandwidth expansion may impact your budget, it’s unlikely to break your apps. And by automating resource constraint remediations, you’re adding a rapid response capability to operations.
In stage two, you augment your operations teams with automated tasks that can be triggered in response to AIOps-identified anomalies.
Stage Three: Connect Visibility to Action (Trust!)
As you start to use automated root cause analysis (RCA), it’s critical to understand the probability concept of machine learning. Surprisingly, for a classical computer technology, ML does not output a binary, 0 or 1 result, but rather produces statistical likelihoods or probabilities of the outcome. The reason this outcome sometimes looks definitive is that a coder or “builder” (the latter if you’re AWS’s Andy Jassy) has decided an acceptable probability will be chosen as the definitive result. But under the covers of ML, there is always a percentage likelihood. The nature of ML means that RCA sometimes will result in a selection of a few probable causes. Over time, the system will train itself on more data and probabilities and grow more accurate, leading to single outcomes where the root cause is clear.
Once trust in RCA is established (stage one), and remediation actions are automated (stage two), it’s time to remove the manual operator from the middle. The low-risk remediations identified in stage two can now be connected to the specific root cause as a fully automated action.
The benefits of automated operations are often listed as cost reduction, productivity, availability, reliability and performance. While all of these apply, there’s also the significant benefit of expertise time. “The main upshot of automation is more free time to spend on improving other parts of the infrastructure,” according to Google’s SRE project. The less time your experts spend in MTTR steps, the more time they can spend on preemption rather than reaction.
Similar to DevOps, AIOps will require a new mindset. After a successful AIOps deployment, your team will be ready to transition from its existing siloed capabilities. Each team member’s current specialization(s) will need to be accompanied with broader skills in other operational silos.
AIOps augments each operations team, including ITOps, DevOps and SRE. By giving each team ample time to move into preemptive mode, AIOps ensures that applications, architectures and infrastructures are ready for the rapid transformations required by today’s business.