The Power of Real-Time: How the On-Demand Revolution Is Changing Performance Monitoring

How brands engage with customers and drive revenue depends on the digital experiences powered by applications.

Today, thanks to companies like Uber — which recently had the biggest IPO of 2019 — along with the likes of Amazon, Airbnb, and more, consumers can order a ride, buy groceries, transfer money, or book a place to stay — all in just a few clicks within an app. In this new world, how brands engage with customers, and how they drive revenue, increasingly depends on the digital experiences powered by applications. In many ways, the app isn’t just a part of the business — it is the business.

While it’s true that this shift has afforded many brands the opportunity to build closer relationships with consumers, those relationships are at risk when user experience is poor and performance issues strike.

That’s because the on-demand revolution has altered people’s expectations for the customer experience.

Nowadays, people expect the apps they use each day to just work. And the users who depend on them aren’t content to wait minutes, hours, or days until a resolution is found. 

Sound harsh? Maybe so.

But in this new on-demand economy, the rules have changed. Consumers have more choices than ever before, and competition is fierce. For every great offering like Uber, there’s a strong alternative like Lyft. In these market conditions, experience and performance are compelling differentiators.

Why Real-Time Matters More to Your Performance Monitoring 

In the on-demand world, customer experiences happen in seconds.

If someone can’t book a ride because your app is slow, or make a purchase on your e-commerce site because a critical page in the checkout process won’t load, the battle for their attention — and their business — may already be lost. 

Surprisingly, many IT organizations still operate in this reactive mode, waiting for problems to be surfaced, and losing valuable time — and revenue — in the process. In fact, according to recent research from AppDynamics, 58% of IT teams find out about performance issues from users calling or emailing their organization’s help desk, and 55% find out from an executive or non-IT team member at their company who informs the IT org.

So, what should businesses and IT leaders do to chart a course forward?

  • Leverage real-time insights to drive customer experience wins. Don’t wait for weekly business intelligence reports to tell you where you need to improve the customer experience. Use real-time insights to monitor performance in relationship to revenue, campaign conversion rates, and overall user engagement so you can make enhancements on the fly, and delight customers in the process.
  • Personalize experiences to drive value for the end user. You can have the best product in the world, but if your promotional codes don’t work, or your website won’t load, it won’t make a difference. To maximize your investment in digital experiences, you must track application performance as well.
  • Put application performance front and center. Leaders should be tracking application performance as it relates to specific lines of business on a real-time basis. This provides valuable context for prioritizing optimizations and helps you lay the groundwork for a proactive approach to running your production environment.

 

Businesses invest in sophisticated marketing activities, thoughtful product development, and value-driven sales efforts. But as applications become a critical part of the customer experience, investments must be made in performance optimization as well.

From Operating to Innovating: The Changing Performance Landscape

The on-demand revolution, prompted by the rise of companies like Uber, Airbnb, Amazon and others, has made instant gratification a part of our lives, and elevated the importance of the customer experience. In this new world, it’s not enough to manage performance reactively. Instead, businesses must take a proactive approach that helps them win and retain customers, and drive growth, all in real-time.

3 New Technologies Enterprises Should Consider for Advanced APM

The game remains the same, but the rules are changing for technologists of today’s digital enterprise. And while cloud adoption hasn’t altered their commitment to end-to-end customer satisfaction, it has made it more complex from a performance and reliability perspective. That’s why the proliferation of cloud means a corresponding increase in the importance of application performance monitoring, and the power of strong alignment between technology and key customer touch points.

To handle the mounting complexity of application environments as well as the data generated by them, enterprises are now shifting their focus to handling rapid data growth while maintaining service availability.

But this requires prioritizing new — but crucial — capabilities when it comes to performance monitoring. According to a new report from 451 Research, IT leaders should look for three key elements to better prepare themselves for this shift.

Here’s a quick summary of the report’s recommendations.

APM + ACI

From its research, 451 found that network admins are increasingly siloed, leading to application slowdowns and longer outage-resolution times. For the admins themselves, of course, that organizational structure makes it difficult to connect the dots in the data.

The solution, 451 Research says, is to integrate application-level data with that of the wider network. This is something we’ve been working on with Cisco ACI, for example. By correlating ACI’s network intelligence with our own visibility into applications and infrastructure, we can find performance problems related to issues in our customers’ networks.

At the same time, this solution gets application and network teams on the same page. With a shared understanding of how apps work together, they can solve issues faster and ultimately deliver positive user experiences.

Serverless Monitoring

It’s no surprise everyone’s so excited about serverless (or FaaS). Services like AWS Lambda, which commands a solid 70% share of the serverless market, are great for building and deploying apps with more agility and without worrying about the nuts and bolts of infrastructure.

But maintaining performance? That comes with its fair share of challenges. While still in their early days, third-party serverless architectures come with a trade-off: lack of control over system downtime, functionality, or unexpected limits.

The problem isn’t unsolvable. It just poses APM vendors new technical hurdles and exciting opportunities for innovation. According to 451 Research, vendors are working on innovative ways to bring their conventional agents into this serverless paradigm. Lambda monitoring, for example, can provide more visibility into application performance by tracing transactions end-to-end through the system architecture.

Machine Learning-Driven Monitoring

Similarly, advanced analytics has enhanced user experiences across industries from streaming (à la Netflix recommendations) to self-driving cars. Digital companies can now think bigger about technological innovation without managing all the complexities of reporting.

The most recent use case underway in our space is machine learning-powered infrastructure monitoring.

By shifting to an artificial intelligence for IT operations (or AIOps) model, APM vendors can sift business-impacting problems from massive amounts of data not only in ways businesses can easily understand, but also in ways that prevent problems from happening in the future. In other words, machine learning can automate root cause analysis. For digital enterprises, that means increased agility in the face of potential downtime.

Want more details on the next wave of APM tech? Download the report

We’re barely scraping the surface of what these new technologies are capable of in the world of performance monitoring. Big data shows no signs of slowing any time soon, and equally quick to evolve are ever-rising customer expectations. The question is how technology keeps up with the level of service expected and when, not if, enterprises can keep pace, too.

Check out the full report for 451 Research’s take on these three capabilities and how ours stack up against our competitors (complete with a SWOT analysis you might find useful). You can download your copy of report here.

Successfully Deploying AIOps, Part 3: The AIOps Apprenticeship

Part one of our series on deploying AIOPs identified how an anomaly breaks into two broad areas: problem time and solution time. Part two described the first deployment phase, which focuses on reducing problem time. With trust in the AIOps systems growing, we’re now ready for part three: taking on solution time by automating actions.

Applying AIOps to Mean Time to Fix (MTTFix)

The power of AIOps comes from continuous enhancement of machine learning powered by improved algorithms and training data, combined with the decreasing cost of processing power. A measured example was Google’s project for accurately reading street address numbers from its street image systems—a necessity in countries where address numbers don’t run sequentially but rather are based on the age of the buildings. Humans examining photos of street numbers have an accuracy of 98%. Back in 2011, the available algorithms and training data produced a trained model with 91% accuracy. By 2013, improvements and retraining boosted this number to 97.5%. Not bad, though humans still had the edge. In 2015, the latest ML models passed human capability at 98.1%. This potential for continuous enhancement makes AIOps a significant benefit for operational response times.

You Already Trust AI/ML with Your Life

If you’ve flown commercially in the past decade, you’ve trusted the autopilot for part of that flight. At some major airports, even the landings are automated, though taxiing is still left to pilots. Despite already trusting AI/ML to this extent, enterprises need more time to trust AI/ML in newer fields such as AIOps. Let’s discuss how to build that trust.

Apprenticeships allow new employees to learn from experienced workers and avoid making dangerous mistakes. They’ve been used for ages in multiple professions; even police departments have a new academy graduate ride along with a veteran officer. In machine learning, ML frameworks need to see meaningful quantities of data in order to train themselves and create nested neural networks that form classification models. By treating automation in AIOps like an apprenticeship, you can build trust and gradually weave AIOps into a production environment.

By this stage, you should already be reducing problem time by deploying AIOps, which delivers significant benefits before adding automation to the mix. These advantages include the ability to train the model with live data, as well as observe the outcomes of baselining. This is the first step towards building trust in AIOps.

Stage One: AIOps-Guided Operations Response

With AIOps in place, operators can address anomalies immediately. At this stage, operations teams are still reviewing anomaly alerts to ensure their validity. Operations is also parsing the root cause(s) identified by AIOps to select the correct issue to address. While remediation is manual at this stage, you should already have a method of tracking common remediations.

In stage one, your operations teams oversee the AIOps system and simultaneously collect data to help determine where auto-remediation is acceptable and necessary.

Stage Two: Automate Low Risk

Automated computer operations began around 1964 with IBM’s OS/360 operating system allowing operators to combine multiple individual commands into a single script, thus automating multiple manual steps into a single command. Initially, the goal was to identify specific, recurring manual tasks and figure out how to automate them. While this approach delivered a short-term benefit, building isolated, automated processes incurred technical debt, both for future updates and eventual integration across multiple domains. Ultimately it became clear that a platform approach to automation could reduce potential tech debt.

Automation in the modern enterprise should be tackled like a microservices architecture: Use a single domain’s management tool to automate small actions, and make these services available to complex, cross-domain remediations. This approach allows your investment in automation to align with the lifespan of the single domain. If your infrastructure moves VMs to containers, the automated services you created for networking or storage are still valid.

You will not automate every single task. Selecting what to automate can be tricky, so when deciding whether to fully automate an anomaly resolution, use these five questions to identify the potential value:

  • Frequency: Does the anomaly resolution occur often enough to warrant automation?
  • Impact: Are you automating the solution to a major issue?
  • Coverage: What proportion of the real-world process can be automated?
  • Probability: Does the process always produce the desired result, or can it be impacted by environmentals?
  • Latency: Will automating the task achieve a faster resolution?

Existing standard operating procedures (SOPs) are a great place to start. With SOPs, you’ve already decided how you want a task performed, have documented the process, and likely have some form of automation (scripts, etc.) in place. Another early focus is to address resource constraints by adding front-end web servers when traffic is high, or by increasing network bandwidth. Growing available resources is low risk compared to restarting applications. While bandwidth expansion may impact your budget, it’s unlikely to break your apps. And by automating resource constraint remediations, you’re adding a rapid response capability to operations.

In stage two, you augment your operations teams with automated tasks that can be triggered in response to AIOps-identified anomalies.

Stage Three: Connect Visibility to Action (Trust!)

As you start to use automated root cause analysis (RCA), it’s critical to understand the probability concept of machine learning. Surprisingly, for a classical computer technology, ML does not output a binary, 0 or 1 result, but rather produces statistical likelihoods or probabilities of the outcome. The reason this outcome sometimes looks definitive is that a coder or “builder” (the latter if you’re AWS’s Andy Jassy) has decided an acceptable probability will be chosen as the definitive result. But under the covers of ML, there is always a percentage likelihood. The nature of ML means that RCA sometimes will result in a selection of a few probable causes. Over time, the system will train itself on more data and probabilities and grow more accurate, leading to single outcomes where the root cause is clear.

Once trust in RCA is established (stage one), and remediation actions are automated (stage two), it’s time to remove the manual operator from the middle. The low-risk remediations identified in stage two can now be connected to the specific root cause as a fully automated action.

The benefits of automated operations are often listed as cost reduction, productivity, availability, reliability and performance. While all of these apply, there’s also the significant benefit of expertise time. “The main upshot of automation is more free time to spend on improving other parts of the infrastructure,” according to Google’s SRE project. The less time your experts spend in MTTR steps, the more time they can spend on preemption rather than reaction.

Similar to DevOps, AIOps will require a new mindset. After a successful AIOps deployment, your team will be ready to transition from its existing siloed capabilities. Each team member’s current specialization(s) will need to be accompanied with broader skills in other operational silos.

AIOps augments each operations team, including ITOps, DevOps and SRE. By giving each team ample time to move into preemptive mode, AIOps ensures that applications, architectures and infrastructures are ready for the rapid transformations required by today’s business.

Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Successfully Deploying AIOps, Part 1: Deconstructing MTTR

Somewhere between waking up today and reading this blog post, AI/ML has done something for you. Maybe Netflix suggested a show, or DuckDuckGo recommended a website. Perhaps it was your photos application asking you to confirm the tag of a specific friend in your latest photo. In short, AI/ML is already embedded into our lives.

The quantity of metrics in development, operations and infrastructure makes development and operations a perfect partner for machine learning. With this general acceptance of AI/ML, it is surprising that organizations are lagging in implementing machine learning in operations automation, according to Gartner.

The level of responsibility you will assign to AIOps and automation comes from two factors:

  • The level of business risk in the automated action
  • The observed success of AI/ML matching real world experiences

The good news is this is not new territory; there is a tried-and-true path for automating operations that can easily be adjusted for AIOps.

It Feels Like Operations is the Last to Know

The primary goal of the operations team is to keep business applications functional for enterprise customers or users. They design, “rack and stack,” monitor performance, and support infrastructure, operating systems, cloud providers and more. But their ability to focus on this prime directive is undermined by application anomalies that consume time and resources, reducing team bandwidth for preemptive work.

An anomaly deviates from what is expected or normal. A crashing application is clearly an anomaly, yet so too is one that was updated and now responds poorly or inconsistently. Detecting an anomaly requires a definition of “normal,” accompanied with monitoring of live streaming metrics to spot when the environment exhibits abnormal behaviour.

The majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem, according to a recent AppDynamics survey of 6,000 global IT leaders. This disappointing outcome can be traced to three trends:

  • Exponential growth of uncorrelated log and metric data triggered by DevOps and Continuous Integration and Continuous Delivery (CI/CD) in the process of automating the build and deployment of applications.
  • Exploding application architecture complexity with service architectures, multi-cloud, serverless, isolation of system logic and system state—all adding dynamic qualities defying static or human visualization.
  • Siloed IT operations and operational data within infrastructure teams.

Complexity and data growth overload development, operations and SRE professionals with data rather than insight, while siloed data prevents each team from seeing the full application anomaly picture.

Enterprises adopted agile development methods in the early 2000s to wash away the time and expense of waterfall approaches. This focus on speed came with technical debt and lower reliability. In the mid-2000s manual builds and testing were identified as the impediment leading to DevOps, and later to CI/CD.

DevOps allowed development to survive agile and extreme approaches by transforming development—and particularly by automating testing and deployment—while leaving production operations basically unchanged. The operator’s role in maintaining highly available and consistent applications still consisted of waiting for someone or something to tell them a problem existed, after which they would manually push through a solution. Standard operating procedures (SOPs) were introduced to prevent the operator from accidentally making a situation worse for recurring repairs. There were pockets of successful automation (e.g., tuning the network) but mostly the entire response was still reactive. AIOps is now stepping up to allow operations to survive in this complex environment, as DevOps did for the agile transformation.

Reacting to Anomalies

DevOps automation removed a portion of production issues. But in the real world there’s always the unpredictable SQL query, API call, or even the forklift driving through the network cable. The good news is that the lean manufacturing approach that inspired DevOps can be applied to incident management.

To understand how to deploy AIOps, we need to break down the “assembly line” used to address an anomaly. The time spent reacting to an anomaly can be broken into two key areas: problem time and solution time.

Problem time: The period when the anomaly has not yet being addressed.

Anomaly management begins with time spent detecting a problem. The AppDynamics survey found that 58% of enterprises still find out about performance issues or full outages from their users. Calls arrive and service tickets get created, triggering professionals to examine whether there really is a problem or just user error. Once an anomaly is accepted as real, the next step generally is to create a war room (physical or Slack channel), enabling all the stakeholders to begin root cause analysis (RCA). This analysis requires visibility into the current and historical system to answer questions like:

  • How do we recreate the timeline?
  • When did things last work normally or when did the anomaly began?
  • How are the application and underlying systems currently structured?
  • What has changed since then?
  • Are all the errors in the logs the result of one or multiple problems?
  • What can we correlate?
  • Who is impacted?
  • Which change is most likely to have caused this event?

Answering these questions leads to the root cause. During this investigative work, the anomaly is still active and users are still impacted. While the war room is working tirelessly, no action to actually rectify the anomaly has begun.

Solution time: The time spent resolving the issues and verifying return-to-normal state.

With the root cause and impact identified, incident management finally crosses over to spending time on the actual solution. The questions in this phase are:

  • What will fix the issue?
  • Where are these changes to be made?
  • Who will make them?
  • How will we record them?
  • What side effects could there be?
  • When will we do this?
  • How will we know it is fixed?
  • Was it fixed?

Solution time is where we solve the incident rather than merely understanding it. Mean time to resolution (MTTR) is the key metric we use to measure the operational response to application anomalies. After deploying the fix and verifying return-to-normal state, we get to go home and sleep.

Deconstructing MTTR

MTTR originated in the hardware world as “mean time to repair”— the full time from error detection to hardware replacement and reinstatement into full service (e.g., swapping out a hard drive and rebuilding the data stored on it). In the software world, MTTR is the time from software running abnormally (an anomaly) to the time when the software has been verified as functioning normally.

Measuring the value of AIOps requires breaking MTTR into subset components. Different phases in deploying AIOps will improve different portions of MTTR. Tracking these subdivisions before and after deployment allows the value of AIOps to be justified throughout.

With this understanding and measurement of existing processes, the strategic adoption of AIOps can begin, which we discuss in part two of this series.

Cognition Engine Unifies AIOps and Application Intelligence

When we welcomed Perspica into the AppDynamics family in 2017, I knew we were going to change the application performance monitoring industry in a big way. And that’s why today is so important for us.

Earlier this morning, we launched Cognition Engine – the next evolution of application performance monitoring that will give customers new levels of insight for a competitive edge in today’s digital-first economy.

When our customers told us that they would spend hours – sometimes days and weeks – to identify the root cause of performance issues, we knew we needed to bring a product to market that would alleviate this pain. And with Cognition Engine, that’s precisely the goal.

You can think of Cognition Engine as a culmination of the best features we’ve brought to market in the past — coupled with new and cutting-edge diagnostic capabilities leveraging the latest in AI/ML technology made possible by our Perspica acquisition. Now, IT teams no longer have to chase symptoms to find the root cause because the top suspects are automatically surfaced.

This level of insight from Cognition completely changes the game for IT, freeing them of tedious tasks and empowering them to focus on projects that will have great business impact. Below are some of Cognition Engine’s core benefits and features:

Avoid Customer-Impacting Performance Issues with Anomaly Detection

Cognition Engine ingests, processes, and analyzes millions of records per second, automatically understanding how metrics correlate, and detecting problems within minutes – giving IT a head start on fixing the problem before it impacts customers.

  • Using ML models, Anomaly Detection automatically evaluates healthy behavior for your application so that you don’t have to manually configure health rules.
  • Get alerts for key Business Transactions to deliver swift diagnostics, root-cause analysis, and remediation down to the line of code, function, thread, or database causing problems.
  • Cognition evaluates data in real-time as it enters the system using streaming analytics technology, allowing teams to analyze metrics and their associated behaviors to evaluate the health of the entire Business Transaction.

Achieve Fastest MTTR with Automated Root Cause Analysis

Cognition Engine automatically isolates metrics that deviate from normal behavior and presents the top suspects of root cause for any application issue – drastically reducing time spent on identifying root cause of performance issues.  

  • Reduce MTTR from minutes to seconds by automating the knowledge of exactly where and when to initiate a performance fix.
  • Understand the contextual insights about application and business health, predict performance deviations, and get alerts before serious customer impact.
  • Self-learning agents take full snapshots of performance anomalies—including code, database calls, and infrastructure metrics—making it easy to determine root-cause.

What Cognition Engine Means for the Enterprise

Cognition Engine ultimately empowers enterprises to embrace an AIOps mindset – valuing proaction over reaction, answers over investigation and, most importantly, never losing focus on customer experience or business performance.

Learn more about Cognition Engine now.

The New Serverless APM for AWS Lambda

To control costs and reduce the burden of infrastructure management in the cloud, more companies are using services like AWS Lambda to deploy serverless functions. Due to the unpredictable nature of end-user demand in today’s digital-first world, serverless functions that can be spun up as needed can also help resolve unplanned scaling issues.

But that’s not to say these serverless workloads don’t impact the overall performance of your application environment. In fact, since these workloads are transient in nature, they represent a real challenge for teams who need to correlate an issue across their application environment, or see the impact that serverless applications are having on end users—or even on the business itself.

How AppDynamics Helps

Today, we’re announcing a new family of application agents that help our customers who use serverless microservices gain more visibility and insight into the performance of their application and its impact on the broader ecosystem.

In the same way that we collect and baseline metrics and events for traditional applications, we can now help serverless users gain deep insight into response times, throughput and exception rates in applications using services built in any mixture of serverless and conventional runtimes. Thus bringing our industry-leading ability to visualize end-user and business impact into the serverless realm, helping teams prioritize issue-resolution efforts and optimize the performance of these ephemeral workloads.

What We Do

The first iteration of AppDynamics’ Serverless Agent family targets Java microservices running in AWS Lambda, and is available as a beta program for qualified customers. Here’s how it works:

The Serverless APM for AWS Lambda allows our customers to instrument their lambda code at entry (when it is invoked from an external request source), and exit (when it invokes some external downstream service), and to ingest incoming or populate outgoing correlation headers. Also, our streamlined approach to collecting metrics and events from serverless functions means you never have to worry about missing an important data point that may have gone unnoticed, or slowing down your otherwise healthy serverless functions.

You can find out more and sign up for AppDynamics’ AWS Lambda beta program on our community site.

Our Vision for AIOps: The Central Nervous System for IT

Exactly two years ago, Cisco announced their intent to acquire AppDynamics, and to say that it’s been quite a ride is a huge understatement.

Since the acquisition, we welcomed Perspica to the family to enrich our machine learning capabilities, expanded product coverage into areas like Business IQ, .NET Core, Kubernetes, SAP, and Mainframe, and leveraged new routes to market through Cisco and our partner programs – all of which helped us accelerate the hyper growth in our business. It’s been an amazing journey that has increased our workforce by 50% and made us one of Glassdoor’s Best Places to Work in 2019.

But while there has been a lot of change, there has always been one constant: Commitment to our customers. Together with Cisco, our mission is to empower Agents of Transformation – great leaders who have the ambition and determination to drive positive change for their customers, and in turn, their organizations, teams, and personal careers.

AppDynamics has been empowering Agents of Transformation since the day we were founded, and with Cisco, our ability to inspire change has been multiplied. That’s why today, I couldn’t be more excited to share the next chapter in the AppDynamics and Cisco story: The Central Nervous System – our vision for AIOps.

AIOps is a Mindset

AIOps enables organizations to leverage artificial intelligence and machine learning to derive real-time insights and begin automating tasks to augment technology operations teams.

But much like DevOps, AIOps will require an internal cultural change – a mindset shift for teams to move away from siloed monitoring tools and reject the notion of emergency war rooms.

When teams embrace an AIOps mindset, endless debugging tasks will be a thing of the past. AI-based systems help identify root case, predict performance, recommend optimizations, and automate fixes in real-time. So now, the time originally spent doing mundane tasks can be better focused on driving new innovation for their business.

The Central Nervous System

A critical element of embracing the AIOps mindset is to have a platform that can take input from various data sources, analyze it, and automate action in real-time.

Similar to how the central nervous system takes input from all the senses and coordinates action throughout the human body, the Cisco and AppDynamics AIOps strategy is to deliver the “Central Nervous System” for IT operations. This gives customers broader visibility of their complex environments, derives AI-based insights, and automates IT tasks to free up resources to drive new innovation.

Bringing the Central Nervous System to Life

For a system of intelligence to work effectively, it needs to understand how application performance impacts business outcomes and customer experience – and that’s exactly what our Business iQ solution makes possible.

Now, powered by the robust data set generated by APM and Business IQ offerings, the AppDynamics Cognition Engine brings real-time insights to mission-critical application and business performance by using machine learning to go beyond problem detection, to root cause identification. Once root cause is identified and exposed through an API, IT teams can start to develop an automation framework for faster remediation and resource optimization.

And then there’s Cisco – a critical piece needed to power the Central Nervous System. It starts with the breadth and depth of the data. Cisco connects and monitors billions of network devices, lights up data centers for hundreds of thousands of customers, blocks over 20 billion security threats per day, and collects hundreds of trillions of application metrics per year. With a rich automation roadmap in place, bringing this massive data set together for cross-domain correlation with machine learning and AI will deliver insights that no other company can provide.

But that’s not all. Cisco’s diverse partner ecosystem will also help develop innovative offers and scale go-to-market efforts. And it’s all of these elements that will fuel the AIOps journey for our customers.

Empowering Agents of Transformation

Together with Cisco, we’re committed to helping our customers at every stage of their AIOps journey. We want to empower great leaders to drive real business transformation and make them Agents of Transformation for their organization and their industry. And with an AIOps mindset that values prediction over reaction, answers over investigation, and actions over analysis – I know that we will.

How AppDynamics’ Diverse Partner Ecosystem Helps Power the Central Nervous System

Today, we announced the Cisco and AppDynamics AIOps strategy to deliver the “Central Nervous System” for IT operations, which is comprised of three core pillars:

  • Give customers broad visibility of their complex environments
  • Derive AI-based insights
  • Take Action on these insights to optimize IT environments and automate tasks to free up resources to drive new innovation

In this post, we’ll dive into the third pillar, Action, and explain four use cases for how AppDynamics integrates with strategic partners to action and automate IT tasks through incident response, event correlation, workload optimization, and communication facilitation.

Incident Response

Most incidents are a result of a change, and understanding and resolving the incident is the most critical part of operations – especially as we layer on levels of abstraction. However, the complexity of our systems is beyond what the human mind can possibly track or manage to understand the impact, making it extremely critical for IT teams to automate what they can to reduce the likelihood of human error.

To help IT teams get their apps back up and running as quickly as possible when incidents occur, AppDynamics integrates with typical workflows to manage incidents, problems, and changes including ServiceNow ITSM, Cherwell, BMC Remedy, and configuration management systems such as Evolven.

By combining AppDynamics’ granular visibility of applications with incident management capabilities, customers can triage user-impacting events before it impacts customers.

Event Correlation

Today’s IT teams are inundated with monitoring tools, according to a recent poll conducted at Gartner’s 2018 IT Infrastructure, Operations & Cloud Strategies Conference. Of the more than 200 respondents, 35% said they had over 30 monitoring tools – and this overload only seems to increase over time, with each tool generating alerts to find the right root cause of issues.

Event Management tools were created to help with this challenge by correlating and analyzing alerts from these disparate systems. AppDynamics integrates with the most commonly used Event Management systems, including ServiceNow Event Management and MoogSoft, which imports AppDynamics’ topological information (like flow maps and business transactions) to make more informed correlation decisions.

AppDynamics can also create events based on our machine learning baselines and anomaly detection. Paired with health rule violations, these provide more substantive alerts that show user experience degradation, which often leads to customer complaints.

Workload Optimization

Under every application, there are various layers of infrastructure, and many of these layers have additional layers constantly being added on top of them.

For example, in most enterprise data centers, we have physical servers with a virtualization layer on top, and more commonly, we are seeing a private cloud or orchestrated containers being implemented on top of them. These systems allow for easier deployment, scalability, and management – but it also comes at a cost as the complexity makes it difficult to ensure the technologies are delivering the right business outcomes.

To help address this challenge, the AppDynamics platform integrates with and monitors technologies such as Pivotal Cloud Foundry, RedHat OpenShift, and the open source Kubernetes platform. Many of these systems can also take telemetry in the form of metrics and events from AppDynamics to make better decisions on how to scale, when to scale, and what the results are. Additionally, AppDynamics integrates with Turbonomic’s platform to help customers optimize and orchestrate workloads on virtualized servers, private clouds, and public cloud environments.

As organizations continue to invest heavily in the cloud, workload optimization is more critical than ever. According to a Forrester Consulting survey with over 700 respondents, 86% said their organization has a multi-cloud strategy and almost half of Enterprises report at least $50 million in annual cloud spending. These statistics make one thing clear: Multi-cloud or hybrid management is not an option, but a requirement.

Facilitate Communication

Many organizations are implementing new ways to move code to production faster through the use of open source systems such as Spinnaker or Jenkins, or more advanced commercial offerings such as Microsoft TFS, Gitlab, CircleCI, or Harness. AppDynamics has integrations with these products to automatically tag and track when code is pushed to systems.

However, as teams accelerate their deployments, this can often result in overlooking processes which ensure additional checks, like a handoff for testing and verification. In fact, in the 12th annual State of Agile Survey, while the use of continuous deployment increased from 35% in 2017 to 37% in 2018, continuous integration dropped from 61% to 54%.

As a result, this lack of integration can lead to high business risk. To avoid this, teams must improve the way they communicate, coordinate, and execute on any issues. Since each organization has different structures with varying roles and expertise, matching up the right experts to resolve the incident is often a requirement.

With the Central Nervous System, we integrate into solutions such as PagerDuty, xMatters, and OpsGenie to help facilitate communication and ensure the coordination of subject matter experts or owners. Ownership and accountability are key elements when going through cultural change and implementing one of the most important of the 3 ways of DevOps, feedback, and collaboration.

Thank you, Partners!

AppDynamics is so proud to work with such an amazing breadth of partners who can helps us optimize the IT environment by using orchestration and automation systems to do intelligent workload placement, cloud cost optimization, incident response, or even security enforcement.
And as we continue to innovate on our offerings and solutions for our customers, I can’t wait see what other partners will join the AppDynamics partner ecosystem.

AppDynamics and Cisco To Host Virtual Event on AIOps and APM


To mark the two year anniversary of Cisco’s intent to acquire AppDynamics, the worldwide leader in IT, networking, and cybersecurity solutions will join AppDynamics for a one-of-a-kind virtual launch event on January 23, 2019. At AppDynamics Transform: AIOps and the Future of Performance Monitoring, David Wadhwani, CEO of AppDynamics, will share what’s next for the two companies, and lead a lively discussion with Cisco executives, Okta’s Chief Information Officer, Mark Settle, and Nancy Gohring, Senior Analyst at 451 Research. At the event, we’ll talk through what challenges leaders face and how they’re preparing for the future of performance monitoring.

Technology Leaders to Weigh In On the Impact of AI and the Future of Performance Monitoring

Today, application infrastructure is increasingly complex. Organizations are building and monitoring public, private, and hybrid cloud infrastructure alongside microservices and third party integrations. And while these developments have made it easier for businesses to scale quickly, they’ve introduced a deluge of data into the IT environment, making it challenging to identify issues and resolve them quickly.

APM solutions like AppDynamics continue to lead the way when it comes to providing real-time business insights to power mission critical business decisions. However, recent research has revealed a potential blind spot for IT teams: A massive 91% of global IT leaders say that monitoring tools only provide data on the performance of their own area of responsibility. For IT teams that want to mitigate risk as a result of performance problems, and business leaders who want to protect their bottom line, this blind spot represents a huge opportunity for improvement.

The Next Chapter in the AppDynamics and Cisco Story

As application environments continue to grow in complexity, so does the need for more comprehensive insight into performance. But technology infrastructure is simply too large and too dynamic for IT operations teams to manage manually. Automation for remediation and optimization is key–and that’s where innovations in artificial intelligence (AI) have the potential to make a huge difference in monitoring activities.

So, what does the future of performance monitoring look like?

Join us at the virtual event on January 23, 2019, to find out. David Wadhwani, alongside Cisco executives, will make an exciting announcement about our next chapter together. During the broadcast, we’ll also feature industry analysts and customers as we engage in a lively conversation about the emerging “AIOps” category, and what impact it will have on the performance monitoring space.

You won’t want to miss this unique virtual event.

Register now for AppDynamics Transform