Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

AppDynamics and Cisco To Host Virtual Event on AIOps and APM


To mark the two year anniversary of Cisco’s intent to acquire AppDynamics, the worldwide leader in IT, networking, and cybersecurity solutions will join AppDynamics for a one-of-a-kind virtual launch event on January 23, 2019. At AppDynamics Transform: AIOps and the Future of Performance Monitoring, David Wadhwani, CEO of AppDynamics, will share what’s next for the two companies, and lead a lively discussion with Cisco executives, Okta’s Chief Information Officer, Mark Settle, and Nancy Gohring, Senior Analyst at 451 Research. At the event, we’ll talk through what challenges leaders face and how they’re preparing for the future of performance monitoring.

Technology Leaders to Weigh In On the Impact of AI and the Future of Performance Monitoring

Today, application infrastructure is increasingly complex. Organizations are building and monitoring public, private, and hybrid cloud infrastructure alongside microservices and third party integrations. And while these developments have made it easier for businesses to scale quickly, they’ve introduced a deluge of data into the IT environment, making it challenging to identify issues and resolve them quickly.

APM solutions like AppDynamics continue to lead the way when it comes to providing real-time business insights to power mission critical business decisions. However, recent research has revealed a potential blind spot for IT teams: A massive 91% of global IT leaders say that monitoring tools only provide data on the performance of their own area of responsibility. For IT teams that want to mitigate risk as a result of performance problems, and business leaders who want to protect their bottom line, this blind spot represents a huge opportunity for improvement.

The Next Chapter in the AppDynamics and Cisco Story

As application environments continue to grow in complexity, so does the need for more comprehensive insight into performance. But technology infrastructure is simply too large and too dynamic for IT operations teams to manage manually. Automation for remediation and optimization is key–and that’s where innovations in artificial intelligence (AI) have the potential to make a huge difference in monitoring activities.

So, what does the future of performance monitoring look like?

Join us at the virtual event on January 23, 2019, to find out. David Wadhwani, alongside Cisco executives, will make an exciting announcement about our next chapter together. During the broadcast, we’ll also feature industry analysts and customers as we engage in a lively conversation about the emerging “AIOps” category, and what impact it will have on the performance monitoring space.

You won’t want to miss this unique virtual event.

Register now for AppDynamics Transform

 

The Rise of AIOps: How Data, Machine Learning, and AI Will Transform Performance Monitoring

Over the last decade, application environments have exploded in complexity.

Gone are the days of managing monoliths. Today’s IT professionals are tasked with ensuring the performance and reliability of distributed systems across virtualized and multi-cloud environments. And while it may be true that the emergence of this modern application environment has provided the speed and flexibility professionals demand, these numerous services have unleashed a deluge of data on the enterprise IT environment.

Application performance monitoring (APM) solutions have proven essential in helping leaders take back control by providing the real-time insights needed to take action. But as the volume of data in IT ecosystems increases, many professionals are finding it challenging to take a proactive approach to managing it all. While automating tasks have helped teams free up some bandwidth for operations and planning, automation alone is no match for today’s increasingly complex environments. What’s needed is a strategy focused on reducing the burden of mounting IT operations responsibilities, and surfacing the insights that matter the most so that businesses can take the right action.

So, what are forward-thinking IT professionals doing to stay ahead of the curve?

Many are applying what’s being called an AIOps approach to the challenge of application environment complexity. This approach leverages advances in machine learning and artificial intelligence (AI) to proactively solve problems that arise in the application environment. Even though relatively new, the approach is gaining momentum. And for good reason: Using AI to identify potential challenges within the application environment doesn’t just help IT professionals get ahead of problems — it helps companies avoid revenue-impacting outages that jeopardize the customer experience, the business, and the brand.

In order to fully understand the rise of AIOps and why it has developed the momentum it has, we wanted to dig deeper to uncover the actual challenges faced by IT professionals, and how they’re managing them in an increasingly complex application environment. To accomplish that, AppDynamics undertook a study of 6,000 global IT leaders in Australia, Canada, France, Germany, the United Kingdom, and the United States. Their responses answered three key questions about the shift in the performance space:

(1) What’s the current enterprise approach to managing increasing application environment complexity?

(2) How are global IT leaders taking a proactive approach to identifying problems in the application environment?

(3) How broadly is AI identified as a potential solution to reducing complexity in IT ecosystems?

Let’s see what the research revealed.

The Demand for Proactive Application Performance Monitoring Tools

Today, midsize to large companies use an average of eight different cloud providers for various enterprise applications and services. As a result, IT professionals are managing an ever-increasing set of tasks that have the potential to become disconnected if not managed properly. What’s more, within these highly distributed systems, IT leaders must grapple with the impact of new code being deployed, as well as the virtually infinite potential outcomes associated with doing so. Without a unified view of how all of these elements interact, there’s significant potential for issues to arise that impact performance — and, ultimately — the customer experience.

New research from AppDynamics underscores the cause for concern: 48% of enterprises surveyed say they’re releasing new features or code at least monthly, but their current approach to monitoring only provides a siloed view on the quality and impact of each release. In fact, of those enterprises that release on that cadence, a massive 91% say that monitoring tools only provide data on how each release drives the performance of their own area of responsibility.

Research from AppDynamics indicates performance monitoring remains siloed.

Should these findings raise eyebrows? Absolutely.

That’s because they indicate that for the vast majority of those surveyed, a holistic view of business and customer value is still difficult to achieve. And that puts innovation — as well as modern, best-in-class software development practices like continuous delivery — at serious risk.

But that’s where leveraging data about the application environment using machine learning, as well as AI, can make a massive difference. Instead of merely ingesting data from every dimension of the application environment, these tools can help IT professionals build a more proactive approach to APM.

And, by all accounts, that’s what most global IT leaders want.

According to research findings from AppDynamics, 74% of surveyed said they want to use monitoring and analytics tools proactively to detect emerging business-impacting issues, optimize user experience, and drive business outcomes like revenue and conversion. But according to our research, 42% of respondents are still using monitoring and analytics tools reactively to find and resolve technical issues. There’s indication, however, that this approach is extremely problematic for businesses. Beyond a serving as a pain point for IT professionals in terms of capacity and resource planning, reactive monitoring — in some cases — can potentially cost businesses hundreds of thousands of dollars in lost revenue.

The majority of IT professionals want to use monitoring tools more proactively.

How Reactive Monitoring Hurts Performance, Revenue, and Brand

From e-commerce to banking, booking flights to watching movies on Netflix, applications have proliferated people’s lives. As a result, consumers have high expectations for application performance that businesses must deliver on. If not, they risk jeopardizing brand loyalty and, as our research revealed, their bottom line.

“As the broader technology landscape undergoes its own dramatic change, forcing businesses to double down on their customer focus, managing the performance of applications has never been more critical to the bottom line.” — Jason Bloomberg, The Rebirth of Application Performance Management

IT professionals have long relied on the mean time to repair (MTTR) metric to evaluate the overall health of an application environment. The longer it takes to resolve an issue, the greater the potential for it to turn into a significant business problem, particularly in an increasingly fast-paced digital world. However, in this latest AppDynamics research, we made a startling discovery: Most organizations are grappling with a high average MTTR:  Respondents reported that it took an average of 1 business day, or seven hours, to resolve a system-wide issue.

But that wasn’t the most alarming finding.

Our research also revealed that many enterprise IT teams weren’t notified about performance issues via monitoring tools at all. In fact:

  • 58% find out from users calling or emailing their organization’s help desk
  • 55% find out from an executive or non-IT team member at their company who informs IT
  • 38% find out from users posting on social networks

AppDynamics research reveals how performance problems are being discovered in the enterprise.

To fully appreciate the impact of 7 hour MTTR on a business, AppDynamics asked survey respondents to report the total number of dollars lost during an hour-long outage, and used that figure to extrapolate the typical cost of an average, day-long outage. For the United States and United Kingdom, the cost of an average outage totals $402,542 USD and $212,254 USD, respectively (the cost of an outage in the United Kingdom was converted into United States dollars).

United States

AppDynamics research revealed that companies in the United States on average lose $402,542 for a single service outage.

United Kingdom

The high cost of a performance outage in the United Kingdom.

It’s important to note that these figures reflect the total cost for a single outage in the enterprise — if a company has more than one, that figure can rise dramatically. In fact, a substantial 97% of global IT leaders surveyed said they’d had performance issues related to business-critical applications in the last six months alone.

Of the 6,000 IT professionals AppDynamics surveyed, 97% said they’d experienced a service outage in the last six months.

In addition to the impact on a company’s bottom line, global IT leaders reported that
reactive performance monitoring had created stressful war room situations and damaged their brand. 36% said they had to pull developers and other teams off other work to analyze and fix problems as they presented themselves, and nearly a quarter of respondents said slow root cause analyses drained resources.

The takeaway here is clear: global IT leaders need to build a more proactive approach to APM in order to lower MTTR and protect their bottom line. But in today’s increasingly complex application environment, that’s easier said than done.

Unless, of course, you’re developing an AIOps strategy to manage it.

The Risk of Not Adopting an AIOps Strategy

AppDynamics research showed that the overwhelming majority of IT professionals want a more proactive approach to APM, but one of the main ways of achieving that — through the adoption of an AIOps strategy — isn’t being widely pursued by global IT teams in the near-term.

In fact, the global IT leaders AppDynamics surveyed reported that although they believe AIOps will be critical to their monitoring strategy, only 15% identified it as a top priority for their business in the next two years.

AppDynamics research reveals that the vast majority of IT professionals surveyed don’t have an AIOps strategy in place in the near-term.

What’s more, the capabilities that respondents identified as essential to APM in the next 5 years are precisely those that AIOps has the potential to help provide. For example:

Intelligent alerting that can be trusted to indicate an emerging issue.
49% of respondents identified this feature as core to their performance monitoring capabilities in the next five years. By ingesting data from any application environment, AIOps platforms and technology can play a pivotal role in not just automating existing IT tasks, but identifying and managing new ones based on potential problems detected in the application environment.

Automated root cause analysis and business impact assessment.
44% of respondents said solving problems quickly and understanding their impact on the business would play a crucial part of their performance management in the years ahead. With the help of AIOps technology, this can be achieved, providing increased agility in the face of potential service disruptions or threats, and without additional drain on resources.

Automated remediation for common issues.
42% of survey respondents said that they needed to build automated remediation into their strategy for performance monitoring. With AIOps, it’s easy to not only automate remediation for known issues, but unknown issues, too. That’s because it not only ingests data from your application environment, but provides more intelligent insights as a result of it.

Leading The Way With AIOps Strategy and Platforms  

Despite increasingly complex application environments, few of the global IT leaders surveyed are prioritizing the development of an AIOps strategy, which would allow them to implement the platforms and practices to permit proactive identification of issues before they become system-wide problems. Instead, global IT leaders report an average MTTR rate that hovers at a full business day, and has the potential to cost companies hundreds of thousands of dollars in lost revenue with each incident.

What’s more, AppDynamics research findings also make it clear that many global IT leaders are struggling to integrate monitoring activities into the purview of the broader business. This can cause significant delays in MTTR, as noted, as well as make companies vulnerable to service disruptions that can cause irreparable harm to the customer experience, and the enterprise as a whole.

While IT leaders have expressed a desire for a more proactive approach to monitoring, this research indicates that there’s still plenty of work to be done on numerous fronts. But the first step is clear: IT leaders must prioritize the development of an AIOps strategy and related technology. In doing so, they’ll  simplify the demands of an increasingly complex application environment, and build a stronger connection from IT to the business as a whole.


Editor’s Note: In this piece, the term “global IT leaders” refers to the respondents surveyed for this report. The term “IT professionals” refers to people in the IT or related professions as a whole.

Using Gartner’s Peer Insights to Evaluate APM Providers

Application performance management (APM) companies are quickly becoming some of the most important players in the IT landscape as enterprises strive to deliver best-in-class customer experiences.

Last year, the application performance management market was valued at $4.6 billion, and that number is projected to nearly double within just a few years, reaching $8.7 billion by 2023. But as the APM market continues to explode, it’s becoming harder than ever to navigate the APM vendor landscape.

With so many options, where do you start your search for an APM vendor that can meet your needs? And how do you narrow the increasingly large and complex field of APM providers to find the right solution for your organization?

The Role of Peer Reviews in the APM Buying Process

Just as you might look to Yelp for a new restaurant recommendation, or wouldn’t dream of purchasing anything on Amazon without at least a 3-star rating, peer reviews have dramatically changed the way we buy things—both in the consumer and business worlds.

Research shows that 63 percent of B2B software buyers already use reviews to help them create a shortlist of products to evaluate, and 62 percent of B2B buyers say that today they rely more on peer recommendations than they did in the past. And it’s easy to see why tech buyers are putting more stock in peer reviews: utilizing peer insights when evaluating technology providers can reduce business risk and optimize the final purchase decision.

For these reasons, it makes sense that relying on the insight and real-world experience of fellow IT professionals is a great place to start when it comes to creating a shortlist of APM providers to consider. And with Gartner’s new APM peer insights report—based on nearly 1,000 reviews and ratings published by real APM end users over the last year—enterprises now have access to the data they need to improve and streamline the APM vendor evaluation and selection processes.

APM Peer Insights Vetted by Gartner

Filled with peer review data that has been rigorously vetted, verified, and analyzed by Gartner, the report can be used to inform your APM buying process with authentic, trustworthy customer feedback.

Here are just a few of the invaluable insights the report dives into detail on:

  • How your peers rate APM vendors on each step in the buying process: evaluation and contracting, integration and deployment, and service and support

  • Which vendors APM technology buyers typically evaluate alongside each other during the consideration stage

  • The only 5 APM providers to receive the coveted 2018 Customers’ Choice distinction (spoiler alert: one was AppDynamics)

  • What percentage of each vendor’s customers are willing to recommend their chosen solution to other organizations

  • Reviewer demographics—such as industry and company size—so you can prioritize feedback from peers that may have similar goals and challenges to yours

Ready to dive in? Get the free Gartner report now to access the peer review data you need to navigate the APM technology landscape, make better shortlisting decisions, and simplify the vendor selection process.

Gartner Peer Insights ‘Voice of the Customer’: Application Performance Monitoring Suites Market, 9 October 2018

Why Application Intelligence + Network Intelligence Equals Better Business Outcomes

As more enterprises distribute applications not only between data centers, but also across data centers and multiple clouds, the application footprint is growing in size and complexity. And with companies increasingly relying on better end-to-end performance as a key requirement for business success, performance implications for these highly distributed and scalable applications are greater than ever.

Indeed, application performance in today’s hyper-connected social world directly impacts a business’s brand, revenues and customer stickiness. Application performance monitoring (APM) is critically important, of course, but APM can provide far better results when application and business performance metrics are leveraged to program the underlying network policy. The end result can be application-driven, end-to-end control that’s highly effective regardless of the underlying network/cloud infrastructure.

In this blog—the first in a series—we’ll examine the pain points associated with the lack of application and network correlation, and discuss the benefits of APM when correlated with underlying network visibility and monitoring. We’ll explore how business and application performance metrics and policy, when correlated with underlying network information, can provide the fastest root cause analysis (RCA). We’ll also look at how this integration between application and network performance can reduce the risk of unexpected application outages, simplify application deployment, and boost trust and understanding across teams. These benefits ultimately will lead to better customer experiences and business outcomes for critical application and business transactions.

The Benefits of Modern Apps

Most applications developed in recent years are highly distributed from the ground up. Traditional client-server models have given way to containerized, virtualized, distributed apps built using state-of-the-art frameworks, technologies and specialized third-party services. A modern app may even be written as a wrapper/enclosure for a legacy application in application-modernization projects. And the use of agile DevOps methods to develop and operate these apps can mean frequent rollouts and changes to production environments.

Modern apps are growing in complexity and scale. They’re capable of running in multiple environments and are accessible via myriad devices, including PCs, mobile gadgets and IoT endpoints. These apps traverse a variety of networks, from traditional data centers to multiple WAN links to the cloud. Within the datacenter (DC)—whether a private DC or a public cloud colo facility—the size and complexity of the underlying network is growing to support modern application deployment models and to scale as needed. All of this is driving the need for faster root-cause identification of problems.

But Modern Apps Can Bring Pain, Too

In contrast to the growing complexity of modern-app deployment, the end-user experience requires great simplicity. Complicating matters is the fact that users demand flawless app performance 24/7. Unsurprisingly, many pain points are associated with achieving this goal.

Let’s examine traditional network issues that adversely affect application performance, which is critical to finding root cause faster. As you’re aware, application slowdowns or failures lead to a poor end-user experience. These incidents can be caused by a number of network-related issues, including:

  • Incorrect network configuration for the application’s needs; something as simple as the duplex or speed of a switch port can cause big problems.

  • Firewall or load balancer misconfiguration—not allowing traffic for a particular application component.

  • Improper permissions that block good traffic from accessing an application service or,  conversely, allow bad traffic to access an app component or service.

  • Packet loss due to overwhelming load on a network device, insufficient bandwidth, or other factors.

  • Packet loops or extra inefficient hops in the network.

  • Network policies that inadvertently impact application performance such as discussed below.

A large portion of modern enterprise application traffic can be classified as east-west—in a datacenter environment, that’s traffic moving between application servers, databases, firewalls, load balancers and enterprise storage devices. Some network issues are unique to modern data centers and can adversely impact both application performance and the end-user experience. Examples include:

  • Wrong mapping of application requirement (policy) to underlying switch fabric/ports.

  • Incorrect switch configuration, causing fabric loops for data between systems, or incorrect drops.

  • Wrong or outdated storage access policy or configuration.

  • Inefficient virtual machine-to-physical port configuration, i.e., wrong virtual-to-physical (v-to-p) or physical-to-virtual (p-to-v) mappings.

  • Cabling issues on top-of-rack (TOR) or end-of-row switches (EOR).

  • Inefficient or wrong power budget, and other factors.

Cloud-related network issues can also impact app performance, including incorrect configuration of virtual private gateway, security group, virtual router capacity, and traditional DC and cloud DC gateway settings.

The Problem with IT Silos

Application outages and slowdowns are often technological in nature, although many are exacerbated by organizational issues. Most IT organizations evolve from silo-based org structures and skill sets, including app opps, datacenter network, wide area network, security, desktops, cloud, and so on. In many cases, these siloed organizations don’t communicate or work well together.

Furthermore, these silos often use their own set of tools for performance monitoring and troubleshooting—different tools for network monitoring of routers, switches, firewalls and load-balancers, for instance. And while these tools may do a decent job of detecting problems, they solve siloed problems for their respective domains.

Another issue is that these tools don’t provide cross-domain correlation, nor are they able to map application slowdowns to specific network issues. And while some tools attempt to do this, they don’t map from business transactions—how an end-user interacts with or uses the application all the way through the network—without extensive war-room involvement.

In production environments (where there is tremendous pressure from the business), these balkanized orgs and tools focus on silo-specific, “not-my-problem” outcomes that fail to resolve end user or customer problems. This phenomenon, known as mean-time-to-innocence (MTTI), zaps time, effort and energy from companies, resulting in a loss of productivity and customer stickiness.

How the Integration of Network, APM and Troubleshooting Brings Value to Ops Teams

The ability to see application performance issues in near-real time, correlated to underlying network performance, is exceptionally valuable. Mapping application changes and policies to underlying data center policy can go a long way toward driving efficiencies inside an organization, as more than three-fourths of data center traffic is east-west, according to Cisco’s Global Cloud Index.

The ability to dynamically discover application topology, as well as proactively identify application performance bottlenecks all the way down to a specific data center or network segment, can prove very beneficial to an organization.

This integration of network, APM and troubleshooting offers many benefits. Some key ones are:

  • Fastest app-to-network root cause analysis: Fast and flexible mapping of application changes to the underlying DC network. By mapping application policy to underlying network policy, network ops teams can receive application-driven information quickly. This increases productivity by avoiding war-room scenarios, and is by far the biggest benefit in modern networks and data centers where enterprise apps are deployed.

  • Reduced risk of unexpected application outages: When app ops can provide proactive alerts to network ops on specific network or data center slowdowns involving an application or business transaction, network ops can focus on the root cause to prevent further performance degradation and/or outages.

  • Simplified application deployment: the ability to generate network policy based on application topology (the whitelist model) helps simplify app deployment.

Finally, from an organizational perspective, correlated views can reduce mean-time-to-innocence. This helps app ops work better with network ops when reporting slowdowns to the business. A common dashboard with important KPIs makes this effort a lot easier. This cooperation not only promotes trust between app ops/devOps and network ops teams, it also provides a better operational view for the business.

A Major Win for App Ops, Network Ops, and the Business

The correlation of application performance metrics—from business transaction and end-user experience all the way through the underlying network—is critical for business and operational excellence. This shared view of application and network performance delivers key benefits such as reduced mean-time-to-innocence, better cross-team collaboration, and a simplified operational business model. Having an app-centric and business-level view of underlying network performance bottlenecks leads to greater customer satisfaction overall.

Schedule a demo to learn how AppDynamics and Cisco are working together to bring this visibility to enterprises everywhere.

Healthcare Reform and Application Performance Monitoring

Regardless of your political views, the healthcare reform is truly, and no pun intended, reforming healthcare in the United States. Everyone is probably familiar with the Affordable Care Act  (ACA) of 2010, or “Obamacare” which was enacted to increase the quality and affordability of healthcare in the United States. Another legislation which affects the healthcare industry was enacted in 2009 and it is commonly known as the “Stimulus”. Among the many provisions of the “Stimulus” or “The American Recovery and Reinvestment Act (ARRA)” are new regulations around Healthcare IT (HIT), chief among those is Meaningful Use (MU).

Broken out in 3 stages, the MU programs provide financial incentives for the “meaningful use” of certified Electronic Medical Records (EHR) technology. To receive an EHR incentive payment, providers have to show that they are using their certified EHR technology by meeting certain measurement thresholds that range from recording patient information as structured data to exchanging summary care records.

Screen Shot 2014-03-31 at 11.17.28 AM

The HIT Industry is Slow to Change

While the ARRA provides financial incentives to hospitals and eligible professionals to automate medical records (let’s call this the carrot), it also penalizes hospitals and eligible professionals that do not demonstrate and attest to MU by reducing Medicare and Medicaid reimbursements over time (let’s call this the stick).

MU will require change and, based on my years experience as an HIT consultant and application provider, the healthcare industry is slow to adopt change.

Prior to joining AppDynamics, I participated on a major system upgrade for a large hospital system. This upgrade was necessary for Meaning Use Attestation. The hospital was three major releases behind from the current release of their EHR software and the features of the software required for MU attestation where only available on the latest release. By the way, this upgrade latency is not uncommon in HIT.

Because of the severity of the change and the complexity of the environment, contingency plans were put in place to assure the hospital could continue to care for patients should any of the EHR components fail to upgrade properly. However, nothing could prepare the team for what happened next. 

A Problem Arises

Two days after the final upgrade outage, and just as everyone was ready to head home after a number of sleepless nights, a frantic call came into the upgrade command center. The call came from the nurse shift supervisor of the emergency department (ED). If you have ever met an ED nurse, you will understand it when I say that an ED nurse is not someone you want to upset.

The vast array of people caring for patients in an emergency department can be overwhelming. In order to provide visibility and bring order into a very intense operation, the ED relies on a number of critical tools. One such tool is the Tracking Board application. The Tracking Board application provides visibility into length of stay, staff assignment, room assignment, lab order tracking, patient criticality and many more vital data points. All of which can have a major effect on patient safety – the top priority for any healthcare professional.

Untitled

The ED tracking board was unusable and the ED was operating in the dark. Without the visibility provided by the Tracking Board application, the ED was at a stand still. While far from ideal, in such situations, the ED shuts down. But because this particular hospital is the only Level I trauma center in the region, this wasn’t an option. Because the software was composed of other modules that were working properly and the technical dependencies among modules, a downgrade wasn’t an option either.

The command center became a war room; Clinical analysts from the hospital, project managers from both sides of the implementation, a large ensemble of high-level clinical and executives from the hospital, the entire infrastructure team, DBA’s, interface engine administrators, and developers from 3 different continents, were all locked in and given clear instruction: “Don’t leave until this issue is resolved”.

Screen Shot 2014-04-23 at 10.02.04 PM

Minutes turned into hour, then into days. The situation in the emergency room was coincidentally turning into an emergency itself. The Chief Medical Information Officer (CMIO) of the hospital brought the vendor project manager to tears and the ED nurses were gathering their torches and pitchforks and marching against the IT department. All appeared lost and after days of outage and close to 1,000 man-hours spent trying to find the root cause of the problem, everyone was ready to walk out.

The patients however, could not walk away. Many of them had life threatening conditions, and the queue outside the ED was only growing longer. Patient safety is job #1 for everyone in healthcare, and the unavailability of the ED Tracking Board application was affecting every patient’s safety!

AppDynamics to the Rescue! 

Clearly, it was time for an intervention. Unlike the TV reality series, this intervention didn’t come in the form of over-emotional family members, but in the form of an APM solution. AppDynamics was deployed and quickly generated a flow map of the entire application environment. Within minutes, business transactions (BT) from within the software itself and from all adjacent systems that interface with the Tracking Board application began to pour in.

A business transaction (BT) is a key feature of AppDynamics, which, in simple terms, allows the users of AppDynamics to map the application based on how the users experiences it.

A key BT captured by AppDynamics was one happily named “UpdateCycle”. As reported by the Tracking Board application vendor, the ”UpdateCycle” BT was responsible for querying it’s own database, the interface engine, and a variety of disconjointed data sources and update an operational dashboard displayed via digital signage throughout the ED.

As the team monitored the application via AppDynamics, looking for clues as to why the application was failing, we noticed that the UpdateCycle transaction volume was 100x what was expected. In general, the Tracking Board dashboards update every minutes for each one of the viewers. Considering there were ~10 viewers at any given time, the system was designed to support tens of transaction per minute, and was failing because we were receiving thousands of transactions per minute.

A faulty client side configuration was overloading the server and causing it to generate slow responses back to the clients. The listener was working overtime, getting a response back every few seconds, and trying to update the ED tracking board constantly, resulting in constantly updates to the signage stations and webpages, making the system inoperable.

Using AppDynamics, the team was able to locate the root cause within one hour of deployment and the change itself took less than 5 minutes. The web server was restarted and all was calm in the kingdom of the ED.

Screen Shot 2014-04-24 at 8.02.32 AM

From the moment “yours truly” recommended the use of AppDynamics until everyone left the war room for good, less than 4 hours had elapsed. Let me say this again, 4 hours was all it took to download the software, install it, allow for traffic capturing and resolution!

Take a few minutes to get complete visibility into the performance of your production applications with AppDynamics today.

Introducing Nodetime for Node.js Monitoring

Node.js is rapidly becoming one of the most popular platforms for building fast, scalable web applications. According to a W3C tech survey, adoption of Node.js doubled in the last year alone, and the Node.js application server is currently the 14th most popular in the world. Today a range of organizations use Node.js to power their mobile applications, including LinkedIn, Walmart and Klout. With Nodetime you can monitor, troubleshoot and diagnose performance issues in your Node.js applications.

Nodetime reveals the internals of your application and infrastructure through profiling and proactive monitoring enabling detailed analysis, fast troubleshooting, performance and capacity optimization. Monitor realtime and historical state of the application by following multiple application metrics. Over three thousand organizations use Nodetime to monitor their Node.js applications, including Condé Nast, Kabam and Change.org.

“We’re very excited to see AppDynamics pursuing the latest and most innovative technologies with this acquisition,” said Nic Johnson, Web Architect at FamilySearch, a customer of both AppDynamics and Nodetime. “Both Nodetime and AppDynamics are essential parts of our toolset, and we’re very excited to see them united in the same product and the same great company. This will mean great things both for AppDynamics customers and for the Node.js community as a whole.”

Nodetime Dashboard for at a glance metrics of application health:

Nodetime Dashboard

CPU profiler showing a backtrace to find the root cause of a performance problem:

Nodetime

NodeTime

Nodetime metrics cover operating system state, garbage collection activity, application capacity, transactions and database calls for supported libraries, such as such as HTTP, File System, Socket.io, Redis, MongoDB, MySQL, PostgreSQL, Memcached and Cassandra.

Explore any application or OS metric via the Nodetime metric browser:

Nodetime

NodeTime

Get started for free with Node.js performance monitor with Nodetime.

Nodetime

Nodetime installation is exteremly easy:

npm install nodetime

Once the Nodetime module is available simply add the following to your node application:


require('nodetime').profile({
accountKey: 'xxx',
appName: 'MyApp'
});

Adding application performance monitoring for Node.js application has never been easier. Enjoy!

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Thoughts? Let us know on Twitter @AppDynamics!

Intelligent Alerting for Complex Applications – PagerDuty & AppDynamics

Screen Shot 2013-04-16 at 2.39.00 PMToday AppDynamics announced integration with PagerDuty, a SaaS-based provider of IT alerting and incident management software that is changing the way IT teams are notified, and how they manage incidents in their mission-critical applications.  By combining AppDynamics’ granular visibility of applications with PagerDuty’s reliable alerting capabilities, customers can make sure the right people are proactively notified when business impact occurs, so IT teams can get their apps back up and running as quickly as possible.

You’ll need a PagerDuty and AppDynamics license to get started – if you don’t already have one, you can sign up for free trials of PagerDuty and AppDynamics online.  Once you complete this simple installation, you’ll start receiving incidents in PagerDuty created by AppDynamics out-of-the-box policies.

Once an incident is filed it will have the following list view:

incident

When the ‘Details’ link is clicked, you’ll see the details for this particular incident including the Incident Log:

incident_details

If you are interested in learning more about the event itself, simply click ‘View message’ and all of the AppDynamics event details are displayed showing which policy was breached, violation value, severity, etc. :

incident_message

Let’s walk through some examples of how our customers are using this integration today.

Say Goodbye to Irrelevant Notifications

Is your work email address included in some sort of group email alias at work and you get several, maybe even dozens, of notifications a day that aren’t particularly relevant to your responsibilities or are intended for other people on your team?  I know I do.  Imagine a world where your team only receives messages when the notifications have to do with their individual role and only get sent to people that are actually on call.  With AppDynamics & PagerDuty you can now build in alerting logic that routes specific alerts to specific teams and only sends messages to the people that are actually on-call.  App response time way above the normal value?  Send an alert to the app support engineer that is on call, not all of his colleagues.  Not having to sift through a bunch of irrelevant alerts means that when one does come through you can be sure it requires YOUR attention right away.

on_call_schedules

Automatic Escalations

If you are only sending a notification and assigning an incident to one person, what happens if that person is out of the office or doesn’t have access to the internet / phone to respond to the alert?  Well, the good thing about the power of PagerDuty is that you can build in automatic escalations.  So, if you have a trigger in AppDynamics to fire off a PagerDuty alert when a node is down, and the infrastructure manager isn’t available, you can automatically escalate and re-assign / alert a backup employee or admin.

escalation_policy

The Sky is Falling!  Oh Wait – We’re Just Conducting Maintenance…

Another potentially annoying situation for IT teams are all of the alerts that get fired off during a maintenance window.  PagerDuty has the concept of a maintenance window so your team doesn’t get a bunch of doomsday messages during maintenance.  You can even setup a maintenance window with one click if you prefer to go that route.

maintenance_window

Either way, no new incidents will be created during this time period… meaning your team will be spared having to open, read, and file the alerts and update / close out the newly-created incidents in the system.

We’re confident this integration of the leading application performance management solution with the leading IT incident management solution will save your team time and make them more productive.  Check out the AppDynamics and PagerDuty integration today!

Introducing AppDynamics for PHP

PHP Logo

It’s been about 12 years since I last scripted in PHP. I pretty much paid my way through college building PHP websites for small companies that wanted a web presence. Back then PHP was the perfect choice, because nearly all the internet service providers had PHP support for free if you registered domain names with them. Java and .NET wasn’t an option for a poor smelly student like me, so I just wrote standard HTML with embedded scriplets of PHP code and bingo–I had dynamic web pages.

Today, 244 million websites run on PHP which is almost 75% of the web. That’s a pretty scary statistic. If only I’d kept coding PHP back when I was 21, I’d be a billionaire by now! PHP is a pretty good example of how open-source technology can go viral and infect millions of developers and organizations world-wide.

Turnkey APMaaS by AppDynamics

Since we launched our Managed Service Provider program late last year, we’ve signed up many MSPs that were interested in adding Application Performance Management-as-a-Service (APMaaS) to their service catalogs.  Wouldn’t you be excited to add a service that’s easy to manage but more importantly easy to sell to your existing customer base?

Service providers like Scicom definitely were (check out the case study), because they are being held responsible for the performance of their customer’s complex, distributed applications, but oftentimes don’t have visibility inside the actual application.  That’s like being asked to officiate an NFL game with your eyes closed.

ref

The sad truth is that many MSPs still think that high visibility in app environments equates to high configuration, high cost, and high overhead.

Thankfully this is 2013.  People send emails instead of snail mail, play Call of Duty instead of Pac-Man, listen to Pandora instead of cassettes, and can have high visibility in app environments with low configuration, low cost, and low overhead with AppDynamics.

Not only do we have a great APM service to help MSPs increase their Monthly Recurring Revenue (MRR), we make it extremely easy for them to deploy this service in their own environments, which, to be candid, is half the battle.  MSPs can’t spend countless hours deploying a new service.  It takes focus and attention away from their core business, which in turn could endanger the SLAs they have with their customers.  Plus, it’s just really annoying.

Introducing: APMaaS in a Box

Here at AppDynamics, we take pride in delivering value quickly.  Most of our customers go from nothing to full-fledged production performance monitoring across their entire environment in a matter of hours in both on-premise and SaaS deployments.  MSPs are now leveraging that same rapid SaaS deployment model in their own environments with something that we like to call ‘APMaaS in a Box’.

At a high level, APMaaS in a Box is large cardboard box with air holes and a fragile sticker wherein we pack a support engineer, a few management servers, an instruction manual, and a return label…just kidding…sorry, couldn’t resist.

man in box w sticker

Simply put, APMaaS in a Box is a set of files and scripts that allows MSPs to provision multi-tenant controllers in their own data center or private cloud and provision AppDynamics licenses for customers themselves…basically it’s the ultimate turnkey APMaaS.

By utilizing AppDynamics’ APMaaS in a Box, MSPs across the world are leveraging our quick deployment, self-service license provisioning, and flexibility in the way we do business to differentiate themselves and gain net new revenue.

Quick Deployment

Within 6 hours, MSPs like NTT Europe who use our APMaaS in a Box capabilities will have all the pieces they need in place to start monitoring the performance of their customer’s apps.  Now that’s some rapid time to value!

Self-Service License Provisioning

MSPs can provision licenses directly through the AppDynamics partner portal.  This gives you complete control over who gets licenses and makes it very easy to manage this process across your customer base.

Flexibility

A MSP can get started on a month-to-month basis with no commitment.  Only paying for what you sell eliminates the cost of shelfware.  MSPs can also sell AppDynamics however they would like to position it and can float licenses across customers.  NTT Europe uses a 3-tier service offering so customers can pick and choose the APM services they’d like to pay for.  Feel free to get creative when packaging this service for customers!

Conclusion

As more and more MSPs move up the stack from infrastructure management to monitoring the performance of their customer’s distributed applications, choosing an APM partner that understands the Managed Services business is of utmost importance.  AppDynamics’ APMaaS in a box capabilities align well with internal MSP infrastructures, and our pricing model aligns with the business needs of Managed Service Providers – we’re a perfect fit.

MSPs who continue to evolve their service offerings to keep pace with customer demands will be well positioned to reap the benefits and future revenue that comes along with staying ahead of the market.  To paraphrase The Great One, MSPs need to “skate where the puck is going to be, not where it has been.”  I encourage all you MSPs out there to contact us today to see how we can help you skate ahead of the curve and take advantage of the growing APM market with our easy to use, easy to deploy APMaaS in a Box.  If you don’t, your competition will…