Slowdown is the New Outage (SINTO)

Common application outage sources have been addressed by implementing Agile, DevOps and CI/CD processes. The resulting increase in system uptime allows site reliability engineers (SREs) to move their focus onto tuning performance, and for good reason. While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss.

Whether accessed via web browsers, smart phones or Internet of Things devices, slowdowns drive customers to abandon shopping carts and consider competitors. Slowdowns lead to reputation loss for enterprises—a loss that may even flow to an engineer’s career. If you were considering hiring an SRE, how much weight would you give to the company’s reputation for poor or unpredictable customer experiences? 

As high blood pressure is a silent killer of humanity, slowdown is the silent killer of reputations.

Slowdowns vs Outages

Consider the significant differences between outages and slowdowns:


Slowdowns are commonly the result of resource constraint. Either you don’t have enough of the resource, or you’re using the resource poorly, causing contention. If you have too many network transactions on a narrow bandwidth, or if system memory is filled with unnecessary locked pages, a slowdown could result. In a prior life while managing hospital data centers, I saw invalid HL7 messages generate recurring error records into message queues and choking inter-hospital communications. Nurses had to run between laboratories and wards with results as the needless error messages caused a slowdown in the genuine laboratory results getting through. We know outages lose customers, but when there are no outages, what will drive customer loss?

Slowdown is the new outage. #slowdownisthenewoutage #SINTO

Insight vs Observability

DevOps methodologies came with a minimum requirement for monitoring application performance in production.


In turn, SRE comes with the requirement for observability—the capacity to reach into the code and answer an unpredictable question.

While observability supports diagnosis, insight is needed for resolution. SRE implementations create a team of engineers delivering a platform of products and processes for developer usage to ensure highest availability. In addition, SRE moves the focus from reaction to proaction, generating a requirement for spotting the initial predictors of slowdown. This creates the need for a way to observe what code is doing while running production. Observable metrics need context to become actionable insight. 

AIOps delivers the ML-driven automatic baselines and contextual correlation to allow SRE teams to engage preemptively. Once a predictor anomaly is triggered, the SRE team can respond by updating a SQL query, coding a new function call, or scaling up resources to prevent the slowdown from escalating into a threat to the business. Post-response, the SRE team can then pass the details back to the application owners for longer-term resolutions. 

While dtrace or manual breakpoints may be great for single applications on single machines,  they will “often fall short while debugging distributed systems as a whole,” notes Cindy Sridharan in Distributed Systems Observability. When trying to diagnose a complete customer experience relying on multiple business transactions in distributed multi-cloud production applications, observability falls short of insight. The good news is that if you have implemented monitoring as part of your DevOps rollout, the APM used to react to outages can be expanded to observe and diagnose slowdowns.

Finding Insight on Top of Observability

Neither monitoring nor observability is an end unto itself. For slowdown detection, we must see the broader picture of the total user experience. We must be able to take a step back from our usual I-shaped technical silos and apply T-shaped skills to seek insight into the causes of slowdowns. 

Supporting observability can overload applications with additional code for metric creation capturable by APM. Observability only requires the individual metrics be present within the code without correlating them into the overall customer experience. 

Delivering insight requires several key functions: 

  • Baselines identifying normal performance
  • Segmented metrics of customer business transactions to identify weak points
  • Levers to isolate code portions within the production environment
  • Common trusted metric sources that span technology silos
  • Overhead minimization when performance is normal 
  • Noise filtering from using ML-trained filters for anomaly detection

 

Creating observability within each application individually incurs technical debt, while an SRE-supporting APM solution can deliver observability across multiple applications. Moving to a DevOps or SRE model is problematic when you lack an understanding of how to observe and gain insight from metrics. Read more on how APM applies to DevOps.

Remember, it is the metric you don’t watch that bites you.

How to Future-Proof Your App Environment: The Four Plays for IT Success

As your enterprise IT infrastructure grows more distributed by the minute, you might know what it feels like to be treading water in an ocean of data.

And you’re not alone. For most companies developing and monitoring apps to accommodate increasingly fragmented customer experiences, innovation comes with complexity in the form of fragmented data and analytics.

Last year, we conducted a survey to probe deeper into the problem and noticed a worrying trend. Under the weight of mounting data, IT leaders are spending too much time reacting to alerts and not enough time building solutions that can optimize performance proactively. At the same time, we’re headed for big changes in the world of application performance management — a world that’s fast outgrowing data distribution.

That’s why we felt it time to distill all we know about today’s IT challenges and how we’re helping solve them for our customers into a four-part playbook for success. It maps out the four most critical areas to focus on when priming your environment to stand the test of data explosion and take advantage of advanced APM.

Rest assured these are plays we’ve tried and proven with our customers as they’ve successfully built future-proof environments. If you’re ready to join them in building a truly automated, proactive approach to managing business and IT performance, this playbook is for you. Download a free copy here. If you’re not sure, let me give you a quick intro to the four plays, why they’ve proven so critical, and what to expect from the playbook.

The AIOps play

Our survey found as many as 42% of IT leaders use monitoring and analytics tools to resolve technical issues reactively. Add that to the fact they’re resolving system-wide issues in silos. No wonder their mean time to resolution (MTTR) has reached a higher than average one business day, which by our calculations costs an average $402,542. Yep. That’s for a single outage.

Revenue aside, this approach clearly has no place in a world where consumers have higher than ever expectations for flawless application performance.

 With this in mind, most survey respondents (74%) said they want to start using monitoring tools proactively to lower MTTR and protect their bottom line. This is where AIOps comes in.

AIOps is an emerging movement that combines machine learning and AI to support IT operations — specifically by enabling self-healing before revenue-impacting problems arise. It’s no substitute for good development, of course, but self-healing is a capability Google considers critically important to the enterprise.

What’s interesting is that AIOps still isn’t being widely adopted by IT teams in the near-term. Only 15% of our survey respondents identified it as a two-year priority, so there’s no better time to adopt it for a competitive edge. Get guidance on how to deploy it in the playbook.

The cloud play

Cloud adoption has become a strategic imperative for enterprises grappling with vast amounts of data. So in this case, adoption has never been higher. According to the 2018 Cisco Global Cloud Index, 95% of all compute workloads will be public and private cloud workloads by 2021.

Does that mean just because everyone else is doing it, you should too?

Consider the two main reasons (from a technical perspective) for why companies are moving to the cloud:

  1. To manage as little infrastructure as possible, whether consolidating workloads onto fewer physical servers, moving virtualized workloads onto someone else’s servers, or going completely serverless.
  2. To facilitate as much scalability and innovation within applications as possible.

At AppDynamics, our customers have also found that cloud migration creates more opportunities to demonstrate the value of APM by revealing how applications perform end-to-end before and after the shift to the cloud. Inversely, this transaction-level visibility helps derisk the migration itself and translates into quite a few business benefits.

We go over these in the playbook, as well as a choose-your-own-adventure roadmap for APM-supported migration based on your company’s readiness for the cloud.

The digital experience monitoring (DEM) play

Delivering great customer experiences isn’t as simple as it once was. Customers expect apps to just… work. In seconds. Every time. That means you’re tasked with keeping tabs on every facet of their experience — and in cases of abandoned shopping carts and other adverse user behaviors, reacting quickly. But as it increases in complexity, CX is harder to track end-to-end.

A desire for cross-functional visibility into complex digital experiences is in part what’s driven digital transformation, which like cloud adoption is imperative to the innovation that makes those experiences amazing. Only 27% of technologists feel ready for digital transformation, but most recognize it as an urgent challenge to overcome.

One way is through digital experience monitoring (DEM).

DEM monitors the “operational excellence and behavior of a digital agent, human or machine, as it interacts with enterprise applications and services,” as defined in Gartner’s 2019 Magic Quadrant for Application Performance Monitoring. In other words, DEM reframes the “customer” experience to include every human- or machine-generated interaction across your digital footprint, correlating these with application performance and business KPIs to capture the overall business value of your apps.

Gartner predicts enterprises will quadruple their APM functionality through 2021 to accommodate DEM. In the playbook, we explain how to create your own strategy.

The DevOps play

Another key tenant of digital transformation is DevOps, a movement designed to support agile software development.

When dev teams are siloed off from ops, you can expect ongoing performance problems in production. What you really want is ongoing application innovation and availability. In today’s digital business, the only way to continually meet customer needs with new features is to transform your software development lifecycle to a DevOps approach that promotes collaboration business-wide for faster, higher-quality releases.

And the rise of DevOps has already had a huge impact on the evolution of IT. As the demand for more and faster innovation increases, it’s helping organizations deploy more code more efficiently.

Still, few are evaluating the holistic impact of new code on application performance. In our survey, 91% of respondents said their monitoring tools only reveal how each release drives the performance of their own area of responsibility, drastically reducing visibility into and preparation for potential system-wide issues.

To monitor continuous improvement more effectively will require making it simpler for teams to get the data they want for their role in the DevOps lifecycle. The playbook provides a DevOps toolkit you can use for better visibility.

Download the playbook for step-by-step strategies

So, could your business benefit from one or more of these plays?

Download the playbook to dive into the strategic stuff. You’ll also find mini case studies and real-world advice from AppDynamics customers.

3 Practical Applications of the Central Nervous System for IT

In a bid to meet customer demand, businesses today are innovating and scaling in faster, more cost-effective ways than ever before. But with this new scale and efficiency also comes greater operational complexity, reducing visibility across the entire technology stack.

As a result, most organizations don’t fully understand the connection between the changes they make and the impact these changes have on customer experience and business performance.

But by embracing an AIOps mindset—and leveraging an AI-powered APM solution—organizations can drive deep cross-domain visibility, insights and automation to deliver an exceptional customer experience.

Introducing the Central Nervous System (CNS)

Similar to how the human central nervous system processes and interprets sensory input to decide what actions should be taken, the Central Nervous System (CNS) for IT—the new vision for AIOps from Cisco and AppDynamics—helps organizations sift through volumes of data to derive real-time insights and orchestrate targeted actions.

So, how are organizations using the CNS in their day-to-day IT practices to accelerate their journeys to AIOps?

Three Real-World Applications of the CNS

The new Central Nervous System Use Case Guide details three practical ways enterprise IT teams are leveraging the combined power of AppDynamics and Cisco:

#1 – Intelligent Auto-Scaling

While IT teams need to keep a close watch over traffic and load to know when to scale an application up or down, many monitoring tools unfortunately require complex, manually intensive workflows.

With AppDynamics’ lightweight agent installed across the application ecosystem, metrics such as load, response time, and errors are automatically collected. These data points are then used to create a dynamic baseline and detect performance anomalies. And though its integration with Cisco Data Center products, when AppDynamics detects an anomaly, it correlates the app data with network metrics from Cisco ACI to identify root cause and recommend an action.

#2 – Automatic Workload Optimization

To benefit from easier deployment and scalability, 86 percent of enterprises today have adopted a multi-cloud strategy. But managing a multi-cloud or hybrid environment brings with it a new set of challenges.

To tame the resulting complexity, AppDynamics can be paired with Cisco’s Workload Optimization Manager (CWOM). Real-time performance insights from AppDynamics gets fed to CWOM, which can instantly scale resources up or down to alleviate or prevent performance issues. As a result, IT teams can optimize and orchestrate workloads, while also ensuring that infrastructure is never the cause of application performance problems.

#3 – Network Triage

Application and network teams too often operate in silos, which can lead to application slowdowns and longer outage-resolution times. Network engineers, for example, often have no application context when troubleshooting network performance issues, making it difficult to isolate the problem impacting end users.

Using AppDynamics with Cisco’s Application Centric Infrastructure (ACI) provides an integrated application-to-network view—from code to underlying infrastructure—of business applications running across multiple clouds and data centers. This gives AppOps and NetOps admins visibility into the entire IT environment, making it easier and faster to identify root cause and troubleshoot the problem in the network.

Dave Wilson, senior director of IT infrastructure and architecture at Paychex, sums up the practical value his organization has seen from the CNS best: “Meeting the demands of our 650,000 customers by delivering a flawless digital experience is our team’s number one priority. However, as we adapt new technologies like AI, chatbots, and self-service tools, the complexity of these new technologies makes delivering that seamless experience to our millions of monthly users even more critical. AppDynamics and Cisco’s vision for the future helps provide a deeper level of visibility and insight into our application environments and can also take automated actions to swiftly improve our digital experience for users.”

Read the Guide

To dive deeper into these three practical applications of the CNS and learn how this powerful new platform can help you on your journey to AIOps, read the Central Nervous System Use Case Guide.

MIT Review: Global Business Leaders Turning to AIOps to Drive IT Performance

When a U.S. company experiences an application outage, the average cost to the business is well over $400,000.

So it makes sense that both IT and business leaders would do everything in their power to avoid such costly outages. Yet, in a recent global survey of more than 6,000 IT leaders, 97% of respondents reported that their organization has had at least one outage in the last six months alone. For IT and business leaders alike, this is a troubling data point. But when considered in the context of increasingly complex application environments, the situation takes on a new level of urgency. The takeaway then becomes clear: Businesses can’t afford to react to performance problems, they have to proactively solve for them.

But how?

The new frontier for IT operations

Today’s IT landscape is different than it was even just a few years ago. Increasingly complex application environments and distributed IT systems have made it more challenging to ensure reliability, as well as effectively monitor and manage the performance of applications and systems. Sticking with the same approach to application performance monitoring (APM) just doesn’t cut it anymore.

To stay competitive, a new report from MIT Technology Review reveals that global leaders like FedEx are increasingly turning to AIOps to drive IT and business performance. And they’re not alone: According to Gartner, 30% of large corporations are projected to exclusively use AIOps tools to monitor applications and infrastructure by 2023, up from just 5% in 2018. But before we can understand the true value of this emerging trend and why you might need it, we’ve got to define what it is.

What is AIOps?

AIOps — a term coined by Gartner and short for “artificial intelligence for IT operations” —refers to the use of artificial intelligence (AI) and machine learning (ML) to automate data correlation, enable root cause analysis, and deliver predictive insights for both IT teams and businesses. AIOps solutions leverage ML to not only automate routine tasks, but also gather and interpret large volumes of historical data to identify potential problems before they manifest themselves in IT environments.

If a server is reaching capacity, for example, AIOps technology can alert the IT team, giving them the opportunity to take action before it even impacts the end user. AIOps can also help IT departments predict capacity for data centers, measure the effectiveness of an organization’s main business applications, and perform cause-and-effect analysis of peak usage traffic patterns.

Why IT teams need an AIOps solution

There’s no arguing that traditional monitoring tools still play a vital role in any APM strategy. But today, IT leaders should also be thinking about building a more proactive approach to performance via AIOps technology.

With its AI and machine learning capabilities, AIOps empowers IT teams to drive innovation, reduce MTTR, and optimize business outcomes. By leveraging company-wide data, monitoring operational and usage statistics, and proactively solving performance problems, AIOps streamlines IT operations and has the potential to prevent outages that could damage a company’s reputation and bottom line.

AIOps in the real world

Given the capabilities of AIOps solutions, the MIT report points out that, “IT executives are increasingly seeking them out as a means to help organizations retrieve, analyze, and extract value from IT operations data.”

“IT executives are increasingly seeking [AIOps solutions] as a means to help organizations retrieve, analyze, and extract value from IT operations data.”

MIT Technology Review

FedEx and Stromberg & Forbes are two organizations that are using AIOps to build a more proactive approach to performance monitoring. While still in the early stages of their AIOps journeys, the technology has already delivered clear wins for both organizations.

At Stromberg & Forbes, CIO Steve Sommer estimates that AIOps shaves off “one-third to one-half” of the hours the company spends on routine maintenance and troubleshooting.

“AIOps is mission control and command central. [Without the technology] I as a human could spend days and weeks trying to navigate through large, complex data sets trying to find solutions to issues,” Sommer points out.

Similarly, AIOps has allowed FedEx to accelerate issue resolution and significantly reduce the manual intervention required to troubleshoot and solve performance problems. Sergio Puga, Senior Technical Program Manager at FedEx Services, says it best: “If our team investigated the same CPU utilization problem using current monitoring tools, it would take six-to-10 full-time employees two-to-five hours to find the source and perform remediation.”

Puga sums up the benefits of AIOps technology this way: “It’s going to drive innovation, streamline FedEx’s operations and make us more reliable, efficient, and competitive—and make my job a lot easier.”

The future of APM

The use of AIOps solutions is poised for exponential growth over the next five years.  And it’s easy to see why: AIOps has the power to deliver transformative benefits for both IT and the business.

Using artificial intelligence and analytics, AIOps can help IT teams avoid expensive outages and reduce MTTR. And, because AIOps is designed to uncover insights more efficiently, it can help the business improve the bottom-line and preserve the end user experience.

So, are you ready to usher in this new era of AIOps?

Get the free report from MIT Technology Review for deeper insights into why AIOps is the key to managing increasingly complex application environments.

The Power of Real-Time: How the On-Demand Revolution Is Changing Performance Monitoring

How brands engage with customers and drive revenue depends on the digital experiences powered by applications.

Today, thanks to companies like Uber — which recently had the biggest IPO of 2019 — along with the likes of Amazon, Airbnb, and more, consumers can order a ride, buy groceries, transfer money, or book a place to stay — all in just a few clicks within an app. In this new world, how brands engage with customers, and how they drive revenue, increasingly depends on the digital experiences powered by applications. In many ways, the app isn’t just a part of the business — it is the business.

While it’s true that this shift has afforded many brands the opportunity to build closer relationships with consumers, those relationships are at risk when user experience is poor and performance issues strike.

That’s because the on-demand revolution has altered people’s expectations for the customer experience.

Nowadays, people expect the apps they use each day to just work. And the users who depend on them aren’t content to wait minutes, hours, or days until a resolution is found. 

Sound harsh? Maybe so.

But in this new on-demand economy, the rules have changed. Consumers have more choices than ever before, and competition is fierce. For every great offering like Uber, there’s a strong alternative like Lyft. In these market conditions, experience and performance are compelling differentiators.

Why Real-Time Matters More to Your Performance Monitoring 

In the on-demand world, customer experiences happen in seconds.

If someone can’t book a ride because your app is slow, or make a purchase on your e-commerce site because a critical page in the checkout process won’t load, the battle for their attention — and their business — may already be lost. 

Surprisingly, many IT organizations still operate in this reactive mode, waiting for problems to be surfaced, and losing valuable time — and revenue — in the process. In fact, according to recent research from AppDynamics, 58% of IT teams find out about performance issues from users calling or emailing their organization’s help desk, and 55% find out from an executive or non-IT team member at their company who informs the IT org.

So, what should businesses and IT leaders do to chart a course forward?

  • Leverage real-time insights to drive customer experience wins. Don’t wait for weekly business intelligence reports to tell you where you need to improve the customer experience. Use real-time insights to monitor performance in relationship to revenue, campaign conversion rates, and overall user engagement so you can make enhancements on the fly, and delight customers in the process.
  • Personalize experiences to drive value for the end user. You can have the best product in the world, but if your promotional codes don’t work, or your website won’t load, it won’t make a difference. To maximize your investment in digital experiences, you must track application performance as well.
  • Put application performance front and center. Leaders should be tracking application performance as it relates to specific lines of business on a real-time basis. This provides valuable context for prioritizing optimizations and helps you lay the groundwork for a proactive approach to running your production environment.

 

Businesses invest in sophisticated marketing activities, thoughtful product development, and value-driven sales efforts. But as applications become a critical part of the customer experience, investments must be made in performance optimization as well.

From Operating to Innovating: The Changing Performance Landscape

The on-demand revolution, prompted by the rise of companies like Uber, Airbnb, Amazon and others, has made instant gratification a part of our lives, and elevated the importance of the customer experience. In this new world, it’s not enough to manage performance reactively. Instead, businesses must take a proactive approach that helps them win and retain customers, and drive growth, all in real-time.

3 New Technologies Enterprises Should Consider for Advanced APM

The game remains the same, but the rules are changing for technologists of today’s digital enterprise. And while cloud adoption hasn’t altered their commitment to end-to-end customer satisfaction, it has made it more complex from a performance and reliability perspective. That’s why the proliferation of cloud means a corresponding increase in the importance of application performance monitoring, and the power of strong alignment between technology and key customer touch points.

To handle the mounting complexity of application environments as well as the data generated by them, enterprises are now shifting their focus to handling rapid data growth while maintaining service availability.

But this requires prioritizing new — but crucial — capabilities when it comes to performance monitoring. According to a new report from 451 Research, IT leaders should look for three key elements to better prepare themselves for this shift.

Here’s a quick summary of the report’s recommendations.

APM + ACI

From its research, 451 found that network admins are increasingly siloed, leading to application slowdowns and longer outage-resolution times. For the admins themselves, of course, that organizational structure makes it difficult to connect the dots in the data.

The solution, 451 Research says, is to integrate application-level data with that of the wider network. This is something we’ve been working on with Cisco ACI, for example. By correlating ACI’s network intelligence with our own visibility into applications and infrastructure monitoring, we can find performance problems related to issues in our customers’ networks.

At the same time, this solution gets application and network teams on the same page. With a shared understanding of how apps work together, they can solve issues faster and ultimately deliver positive user experiences.

Serverless Monitoring

It’s no surprise everyone’s so excited about serverless (or FaaS). Services like AWS Lambda, which commands a solid 70% share of the serverless market, are great for building and deploying apps with more agility and without worrying about the nuts and bolts of infrastructure.

But maintaining performance? That comes with its fair share of challenges. While still in their early days, third-party serverless architectures come with a trade-off: lack of control over system downtime, functionality, or unexpected limits.

The problem isn’t unsolvable. It just poses APM vendors new technical hurdles and exciting opportunities for innovation. According to 451 Research, vendors are working on innovative ways to bring their conventional agents into this serverless paradigm. Lambda monitoring, for example, can provide more visibility into application performance by tracing transactions end-to-end through the system architecture.

Machine Learning-Driven Monitoring

Similarly, advanced analytics has enhanced user experiences across industries from streaming (à la Netflix recommendations) to self-driving cars. Digital companies can now think bigger about technological innovation without managing all the complexities of reporting.

The most recent use case underway in our space is machine learning-powered infrastructure monitoring.

By shifting to an artificial intelligence for IT operations (or AIOps) model, APM vendors can sift business-impacting problems from massive amounts of data not only in ways businesses can easily understand, but also in ways that prevent problems from happening in the future. In other words, machine learning can automate root cause analysis. For digital enterprises, that means increased agility in the face of potential downtime.

Want more details on the next wave of APM tech? Download the report

We’re barely scraping the surface of what these new technologies are capable of in the world of performance monitoring. Big data shows no signs of slowing any time soon, and equally quick to evolve are ever-rising customer expectations. The question is how technology keeps up with the level of service expected and when, not if, enterprises can keep pace, too.

Check out the full report for 451 Research’s take on these three capabilities and how ours stack up against our competitors (complete with a SWOT analysis you might find useful). You can download your copy of report here.

Successfully Deploying AIOps, Part 3: The AIOps Apprenticeship

Part one of our series on deploying AIOPs identified how an anomaly breaks into two broad areas: problem time and solution time. Part two described the first deployment phase, which focuses on reducing problem time. With trust in the AIOps systems growing, we’re now ready for part three: taking on solution time by automating actions.

Applying AIOps to Mean Time to Fix (MTTFix)

The power of AIOps comes from continuous enhancement of machine learning powered by improved algorithms and training data, combined with the decreasing cost of processing power. A measured example was Google’s project for accurately reading street address numbers from its street image systems—a necessity in countries where address numbers don’t run sequentially but rather are based on the age of the buildings. Humans examining photos of street numbers have an accuracy of 98%. Back in 2011, the available algorithms and training data produced a trained model with 91% accuracy. By 2013, improvements and retraining boosted this number to 97.5%. Not bad, though humans still had the edge. In 2015, the latest ML models passed human capability at 98.1%. This potential for continuous enhancement makes AIOps a significant benefit for operational response times.

You Already Trust AI/ML with Your Life

If you’ve flown commercially in the past decade, you’ve trusted the autopilot for part of that flight. At some major airports, even the landings are automated, though taxiing is still left to pilots. Despite already trusting AI/ML to this extent, enterprises need more time to trust AI/ML in newer fields such as AIOps. Let’s discuss how to build that trust.

Apprenticeships allow new employees to learn from experienced workers and avoid making dangerous mistakes. They’ve been used for ages in multiple professions; even police departments have a new academy graduate ride along with a veteran officer. In machine learning, ML frameworks need to see meaningful quantities of data in order to train themselves and create nested neural networks that form classification models. By treating automation in AIOps like an apprenticeship, you can build trust and gradually weave AIOps into a production environment.

By this stage, you should already be reducing problem time by deploying AIOps, which delivers significant benefits before adding automation to the mix. These advantages include the ability to train the model with live data, as well as observe the outcomes of baselining. This is the first step towards building trust in AIOps.

Stage One: AIOps-Guided Operations Response

With AIOps in place, operators can address anomalies immediately. At this stage, operations teams are still reviewing anomaly alerts to ensure their validity. Operations is also parsing the root cause(s) identified by AIOps to select the correct issue to address. While remediation is manual at this stage, you should already have a method of tracking common remediations.

In stage one, your operations teams oversee the AIOps system and simultaneously collect data to help determine where auto-remediation is acceptable and necessary.

Stage Two: Automate Low Risk

Automated computer operations began around 1964 with IBM’s OS/360 operating system allowing operators to combine multiple individual commands into a single script, thus automating multiple manual steps into a single command. Initially, the goal was to identify specific, recurring manual tasks and figure out how to automate them. While this approach delivered a short-term benefit, building isolated, automated processes incurred technical debt, both for future updates and eventual integration across multiple domains. Ultimately it became clear that a platform approach to automation could reduce potential tech debt.

Automation in the modern enterprise should be tackled like a microservices architecture: Use a single domain’s management tool to automate small actions, and make these services available to complex, cross-domain remediations. This approach allows your investment in automation to align with the lifespan of the single domain. If your infrastructure moves VMs to containers, the automated services you created for networking or storage are still valid.

You will not automate every single task. Selecting what to automate can be tricky, so when deciding whether to fully automate an anomaly resolution, use these five questions to identify the potential value:

  • Frequency: Does the anomaly resolution occur often enough to warrant automation?
  • Impact: Are you automating the solution to a major issue?
  • Coverage: What proportion of the real-world process can be automated?
  • Probability: Does the process always produce the desired result, or can it be impacted by environmentals?
  • Latency: Will automating the task achieve a faster resolution?

Existing standard operating procedures (SOPs) are a great place to start. With SOPs, you’ve already decided how you want a task performed, have documented the process, and likely have some form of automation (scripts, etc.) in place. Another early focus is to address resource constraints by adding front-end web servers when traffic is high, or by increasing network bandwidth. Growing available resources is low risk compared to restarting applications. While bandwidth expansion may impact your budget, it’s unlikely to break your apps. And by automating resource constraint remediations, you’re adding a rapid response capability to operations.

In stage two, you augment your operations teams with automated tasks that can be triggered in response to AIOps-identified anomalies.

Stage Three: Connect Visibility to Action (Trust!)

As you start to use automated root cause analysis (RCA), it’s critical to understand the probability concept of machine learning. Surprisingly, for a classical computer technology, ML does not output a binary, 0 or 1 result, but rather produces statistical likelihoods or probabilities of the outcome. The reason this outcome sometimes looks definitive is that a coder or “builder” (the latter if you’re AWS’s Andy Jassy) has decided an acceptable probability will be chosen as the definitive result. But under the covers of ML, there is always a percentage likelihood. The nature of ML means that RCA sometimes will result in a selection of a few probable causes. Over time, the system will train itself on more data and probabilities and grow more accurate, leading to single outcomes where the root cause is clear.

Once trust in RCA is established (stage one), and remediation actions are automated (stage two), it’s time to remove the manual operator from the middle. The low-risk remediations identified in stage two can now be connected to the specific root cause as a fully automated action.

The benefits of automated operations are often listed as cost reduction, productivity, availability, reliability and performance. While all of these apply, there’s also the significant benefit of expertise time. “The main upshot of automation is more free time to spend on improving other parts of the infrastructure,” according to Google’s SRE project. The less time your experts spend in MTTR steps, the more time they can spend on preemption rather than reaction.

Similar to DevOps, AIOps will require a new mindset. After a successful AIOps deployment, your team will be ready to transition from its existing siloed capabilities. Each team member’s current specialization(s) will need to be accompanied with broader skills in other operational silos.

AIOps augments each operations team, including ITOps, DevOps and SRE. By giving each team ample time to move into preemptive mode, AIOps ensures that applications, architectures and infrastructures are ready for the rapid transformations required by today’s business.

Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.

Cognition Engine Unifies AIOps and Application Intelligence

When we welcomed Perspica into the AppDynamics family in 2017, I knew we were going to change the application performance monitoring industry in a big way. And that’s why today is so important for us.

Earlier this morning, we launched Cognition Engine – the next evolution of application performance monitoring that will give customers new levels of insight for a competitive edge in today’s digital-first economy.

When our customers told us that they would spend hours – sometimes days and weeks – to identify the root cause of performance issues, we knew we needed to bring a product to market that would alleviate this pain. And with Cognition Engine, that’s precisely the goal.

You can think of Cognition Engine as a culmination of the best features we’ve brought to market in the past — coupled with new and cutting-edge diagnostic capabilities leveraging the latest in AI/ML technology made possible by our Perspica acquisition. Now, IT teams no longer have to chase symptoms to find the root cause because the top suspects are automatically surfaced.

This level of insight from Cognition completely changes the game for IT, freeing them of tedious tasks and empowering them to focus on projects that will have great business impact. Below are some of Cognition Engine’s core benefits and features:

Avoid Customer-Impacting Performance Issues with Anomaly Detection

Cognition Engine ingests, processes, and analyzes millions of records per second, automatically understanding how metrics correlate, and detecting problems within minutes – giving IT a head start on fixing the problem before it impacts customers.

  • Using ML models, Anomaly Detection automatically evaluates healthy behavior for your application so that you don’t have to manually configure health rules.
  • Get alerts for key Business Transactions to deliver swift diagnostics, root-cause analysis, and remediation down to the line of code, function, thread, or database causing problems.
  • Cognition evaluates data in real-time as it enters the system using streaming analytics technology, allowing teams to analyze metrics and their associated behaviors to evaluate the health of the entire Business Transaction.

Achieve Fastest MTTR with Automated Root Cause Analysis

Cognition Engine automatically isolates metrics that deviate from normal behavior and presents the top suspects of root cause for any application issue – drastically reducing time spent on identifying root cause of performance issues.  

  • Reduce MTTR from minutes to seconds by automating the knowledge of exactly where and when to initiate a performance fix.
  • Understand the contextual insights about application and business health, predict performance deviations, and get alerts before serious customer impact.
  • Self-learning agents take full snapshots of performance anomalies—including code, database calls, and infrastructure metrics—making it easy to determine root-cause.

What Cognition Engine Means for the Enterprise

Cognition Engine ultimately empowers enterprises to embrace an AIOps mindset – valuing proaction over reaction, answers over investigation and, most importantly, never losing focus on customer experience or business performance.

Learn more about Cognition Engine now.