Successfully Deploying AIOps, Part 2: Automating Problem Time

In part one of our Successfully Deploying AIOps series, we identified how an anomaly breaks into two broad areas: problem time and solution time. The first phase in deploying AIOps focuses on reducing problem time, with some benefit in solution time as well. This simply requires turning on machine learning within an AIOps-powered APM solution. Existing operations processes will still be defining, selecting and implementing anomaly rectifications. When you automate problem time, solution time commences much sooner, significantly reducing an anomaly’s impact.

AIOps: Not Just for Production

Anomalies in test and quality assurance (QA) environments cost the enterprise time and resources. AIOps can deliver significant benefits here. Applying the anomaly resolution processes seen in production will assist developers navigating the deployment cycle.

Test and QA environments are expected to identify problems before production deployment. Agile and DevOps approaches have introduced rapid, automated building and testing of applications. Though mean time to resolution (MTTR) is commonly not measured in test and QA environments (which aren’t as critical as those supporting customers), the benefits to time and resources still pay off.

Beginning your deployment in test and QA environments allows a lower-risk, yet still valuable, introduction to AIOps. These pre-production environments have less business impact, as they are not visited by customers. Understanding performance changes between application updates is critical to successful deployment. Remember, as the test and QA environments will not have the production workload available, it’s best to recreate simulated workloads through synthetics testing.

With trust in AIOps built from first applying AIOps to mean time to detect (MTTD), mean time to know (MTTK) and mean time to verify (MTTV) in your test and QA environments, your next step will be to apply these benefits to production. Let’s analyze where you’ll find these initial benefits.

Apply AI/ML to Detection (MTTD)

An anomaly deviates from what is expected or normal. Detecting an anomaly requires a definition of “normal” and a monitoring of live, streaming metrics to see when they become abnormal. A crashing application is clearly an anomaly, as is one that responds poorly or inconsistently after an update.

With legacy monitoring tools, defining “normal” was no easy task. Manually setting thresholds required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, containers, databases, operating systems, virtual machines, hypervisors and underlying storage.

AIOps removes the stress of threshold-setting by letting machine learning baseline your environment. AI/ML applies mathematical algorithms to different data features seeking correlations. With AppDynamics, for example, you simply run APM for a week. AppDynamics observes your application over time and creates baselines, with ML observing existing behavioral metrics and defining a range of normal behavior with time-based and contextual correlation. Time-based correlation removes alerts related to the normal flow of business—for example, the login spike that occurs each morning as the workday begins; or the Black Friday or Guanggun Jie traffic spikes driven by cultural events. Contextual correlation pairs metrics that track together, enabling anomaly identification and alerts later when the metrics don’t track together.

AIOps will define “normal” by letting built-in ML watch the application and automatically create a baseline. So again, install APM and let it run. If you have specific KPIs, you can add these on top of the automatic baselines as health rules. With baselines defining normal, AIOps will watch metric streams in real time, with the model tuned to identify anomalies in real time, too.

Apply AI/ML to Root Cause Analysis (MTTK)

The first step to legacy root cause analysis (RCA) is to recreate the timeline: When did the anomaly begin, and what significant events occurred afterward? You could search manually through error logs to uncover the time of the first error. This can be misleading, however, as sometimes the first error is an outcome, not a cause (e.g., a crash caused by a memory overrun is the result of a memory leak running for a period of time before the crash).

In the midst of an anomaly, multiple signifiers often will indicate fault. Logs will show screeds of errors caused by stress introduced by the fault, but fail to identify the underlying defect. The operational challenge is unpacking the layers of resultant faults to identify root cause. By pinpointing this cause, we can move onto identifying the required fix or reconfiguration to resolve the issue.

AIOps creates this anomaly timeline automatically. It observes data streams in real time and uses historical and contextual correlation to identify the anomaly’s origin, as well as any important state changes during the anomaly. Even with a complete timeline, it’s still a challenge to reduce the overall noise level. AIOps addresses this by correlating across domains to filter out symptoms from possible causes.

There’s a good reason why AIOps’ RCA output may not always identify a single cause. Trained AI/ML models do not always produce a zero or one outcome, but rather work in a world of probabilities or likelihoods. The output of a self-taught ML algorithm will be a percentage likelihood that the resulting classification is accurate. As more data is fed to the algorithm, these outcome percentages may change if new data makes a specific output classification more likely. Early snapshots may indicate a priority list of probable causes that later refine down to a single cause, as more data runs through the ML models.

RCA is one area where AI/ML delivers the most value, and the time spent on RCA is the mean time to know (MTTK). While operations is working on RCA, the anomaly is still impacting customers. The pressure to conclude RCA quickly is why war rooms get filled with every possible I-shaped professional (a deep expert in a particular silo of skills) in order to eliminate the noise and get to the signal.

Apply AI/ML to Verification

Mean time to verify (MTTV) is the remaining MTTR portion automated in phase one of an AIOps rollout. An anomaly concludes when the environment returns to normal, or even to a new normal. The same ML mechanisms used for detection will minimize MTTV, as baselines already provide the definition of normal you’re seeking to regain. ML models monitoring live ETL streams of metrics from all sources provide rapid identification when the status returns to normal and the anomaly is over.

Later in your rollout when AIOps is powering fully automated responses, this rapid observation and response is critical, as anomalies are resolved without human intervention.  Part three of this series will discuss connecting this visibility and insight to action.

Why Your Business Needs a ‘Wrapper’ to Transform Itself

How do you treat your most precious items? Chances are you wrap them up nicely to protect  them for longevity. Since your applications are the most precious items to your business, consider wrapping them with a solution that protects them for the long haul.

Modern applications are changing the way we conduct our day-to-day lives. Disruptive companies like Uber and Lyft are capitalizing on the consumer need for better, faster access to in-demand products and services. Could they have done this with a poor performing application? Most likely, no. This transformation is causing a ripple effect, as companies change their internal processes to develop projects faster. But despite the critical need to fully understand the complete software lifecycle—from planning and design to rollout and business impact—companies are using lackluster monitoring tools that provide only a siloed glimpse of their total environment, not a comprehensive view.

Business leaders increasingly are keen to see how their software impacts the bottom line. But problems arise when independent factions actively involved in the software development lifecycle are unable to see how their actions impact other teams. This shows the critical need for a cultural shift inside corporations, one that tightly aligns multiple teams throughout the entire development lifecycle. By changing the culture to enable cohesive team interaction and full lifecycle visibility, companies will find it far easier to verify if development changes are positively impacting the business. Conversely, when changes are not beneficial, firms will be able to quickly course-correct to reduce or eliminate negative trends.

Your Team Affects My Team

In recent years, companies have made tremendous strides in improving processes to enable faster software development, releases and MTTR. And yet many groups within these organizations remain unaware of how their actions, changes and implementations affect other teams. Say, for example, an automobile insurance provider releases a new application, a crucial component of its digital transformation. This application proves wildly successful, and many groups within the company develop an interest in its performance. For example:

  • The development team leverages the newest container technology to ensure proper scalability.
  • The infrastructure and network teams enable additional capacity and cloud capabilities.
  • The security team keeps a close eye on fraud and hacking.
  • Multiple teams ensure the best user experience.
  • Lastly, the business, keen on revenue performance, sees the application as a big revenue driver that will lower the cost of customer acquisition.

Ideally this leads to closer scrutiny of each group’s performance, which ultimately leads to greater customer satisfaction. This poses a problem, though, when each group operates within its own silo. For instance, when the network team fixes a problem or makes an upgrade or enhancement, it may not be aware which groups along the application flow are being impacted. Conversely, other groups may see an impact to their application without knowing the reason for the change.

Granted, most of us have change management procedures in place. But full visibility enables you to quickly triage and understand how all teams in the organization are being impacted, both positively and negatively. This visibility has become a fundamental requirement of today’s digital transformational efforts, and is essential for every team following the path of the application. Even modifications to marketing campaigns can cause a flurry of team activity if the company doesn’t quickly see the gains it’s after.  We’ve all sat in conference rooms to draw out a lifecycle that resembles the diagram below, where each group is part of the overall flow: A never-ending cycle of dependencies.

As part of this ongoing process, each group enacts changes to enhance its efficiency. The DevOps movement, for example, is a culture shift designed to help companies deploy applications faster and respond more adeptly to customer expectations. But ultimately, connecting every team within an organization requires a “wrapper” of sorts around the entire workflow process—one tying all domains together, including the lines of business.

This is easier said than done in some organizations, however, particularly those that have operated under a tried-and-true process and culture for many years. But with today’s business environments evolving at breakneck speed, companies must adapt much faster to survive. This brings us back to the concept of a wrapper—a comprehensive tool covering multiple domains to help provide full visibility of the application and the user journey throughout your business environment. By delivering these real-time insights, the wrapper ensures your business is moving in the right direction, enabling you to justify budgetary needs for future investment.

This is where AppDynamics comes in. Think about the demand placed on IT and business leaders, and the need to transform the enterprise. The critical element here is the need for the right tools. One of the first steps to consider: how can you gain a full view of your development, testing, implementation, production and business systems? The best solution must provide multiple benefits to ensure success, enabling you to detect both technology and business-related problems. It should help you understand how your end users are impacted by your technology, and even deliver insights to help you determine where to prioritize future  enhancements.

By leveraging AppDynamics, you’ll gain a full view of your critical applications across all stacks, as well as deep insights into how well your business is performing. A successful AIOps strategy with automated root cause analysis will provide the core framework for understanding all the working intricacies in your environment—a major first step toward maintaining a competitive edge.

Turnkey APMaaS by AppDynamics

Since we launched our Managed Service Provider program late last year, we’ve signed up many MSPs that were interested in adding Application Performance Management-as-a-Service (APMaaS) to their service catalogs.  Wouldn’t you be excited to add a service that’s easy to manage but more importantly easy to sell to your existing customer base?

Service providers like Scicom definitely were (check out the case study), because they are being held responsible for the performance of their customer’s complex, distributed applications, but oftentimes don’t have visibility inside the actual application.  That’s like being asked to officiate an NFL game with your eyes closed.

ref

The sad truth is that many MSPs still think that high visibility in app environments equates to high configuration, high cost, and high overhead.

Thankfully this is 2013.  People send emails instead of snail mail, play Call of Duty instead of Pac-Man, listen to Pandora instead of cassettes, and can have high visibility in app environments with low configuration, low cost, and low overhead with AppDynamics.

Not only do we have a great APM service to help MSPs increase their Monthly Recurring Revenue (MRR), we make it extremely easy for them to deploy this service in their own environments, which, to be candid, is half the battle.  MSPs can’t spend countless hours deploying a new service.  It takes focus and attention away from their core business, which in turn could endanger the SLAs they have with their customers.  Plus, it’s just really annoying.

Introducing: APMaaS in a Box

Here at AppDynamics, we take pride in delivering value quickly.  Most of our customers go from nothing to full-fledged production performance monitoring across their entire environment in a matter of hours in both on-premise and SaaS deployments.  MSPs are now leveraging that same rapid SaaS deployment model in their own environments with something that we like to call ‘APMaaS in a Box’.

At a high level, APMaaS in a Box is large cardboard box with air holes and a fragile sticker wherein we pack a support engineer, a few management servers, an instruction manual, and a return label…just kidding…sorry, couldn’t resist.

man in box w sticker

Simply put, APMaaS in a Box is a set of files and scripts that allows MSPs to provision multi-tenant controllers in their own data center or private cloud and provision AppDynamics licenses for customers themselves…basically it’s the ultimate turnkey APMaaS.

By utilizing AppDynamics’ APMaaS in a Box, MSPs across the world are leveraging our quick deployment, self-service license provisioning, and flexibility in the way we do business to differentiate themselves and gain net new revenue.

Quick Deployment

Within 6 hours, MSPs like NTT Europe who use our APMaaS in a Box capabilities will have all the pieces they need in place to start monitoring the performance of their customer’s apps.  Now that’s some rapid time to value!

Self-Service License Provisioning

MSPs can provision licenses directly through the AppDynamics partner portal.  This gives you complete control over who gets licenses and makes it very easy to manage this process across your customer base.

Flexibility

A MSP can get started on a month-to-month basis with no commitment.  Only paying for what you sell eliminates the cost of shelfware.  MSPs can also sell AppDynamics however they would like to position it and can float licenses across customers.  NTT Europe uses a 3-tier service offering so customers can pick and choose the APM services they’d like to pay for.  Feel free to get creative when packaging this service for customers!

Conclusion

As more and more MSPs move up the stack from infrastructure management to monitoring the performance of their customer’s distributed applications, choosing an APM partner that understands the Managed Services business is of utmost importance.  AppDynamics’ APMaaS in a box capabilities align well with internal MSP infrastructures, and our pricing model aligns with the business needs of Managed Service Providers – we’re a perfect fit.

MSPs who continue to evolve their service offerings to keep pace with customer demands will be well positioned to reap the benefits and future revenue that comes along with staying ahead of the market.  To paraphrase The Great One, MSPs need to “skate where the puck is going to be, not where it has been.”  I encourage all you MSPs out there to contact us today to see how we can help you skate ahead of the curve and take advantage of the growing APM market with our easy to use, easy to deploy APMaaS in a Box.  If you don’t, your competition will…

AppDynamics & Splunk – Better Together

AppD & Splunk LogoA few months ago I saw an interesting partnership announcement from Foursquare and OpenTable.  Users can now make OpenTable reservations at participating restaurants from directly within the Foursquare mobile app.  My first thought was, “What the hell took you guys so long?” That integration makes sense on so many levels, I’m surprised it hadn’t already been done.

So when AppDynamics recently announced a partnership with Splunk, I viewed that as another no-brainer.  Two companies with complementary solutions making it easier for customers to use their products together – makes sense right?  It does to me, and I’m not alone.

I’ve been demoing a prototype of the integration for a few months now at different events across the country, and at the conclusion of each walk-through I’d get some variation of the same question, “How do I get my hands on this?”  Well, I’m glad to say the wait is over – the integration is available today as an App download on Splunkbase.  You’ll need a Splunk and AppDynamics license to get started – if you don’t already have one, you can sign up for free trials of Splunk and AppDynamics online.

How Monitoring Analytics can make DevOps more Agile

The word “analytics” is an interesting and often abused term in the world of application monitoring. For the sake of correctness, I’m going to reference Wikipedia in how I define analytics:

Analytics is the discovery and communication of meaningful patterns in data.

Simply put, analytics should make IT’s life easier. Analytics should point out the bleeding obvious from all the monitoring data available, and guide IT so they can effectively manage the performance and availability of their application(s). Think of analytics as “doing the hard work” or “making sense” of the data being collected, so IT doesn’t have to spend hours figuring out for themselves what is being impacted and why.

Discovery
This is about how effectively a monitoring solution can self-learn the environment it’s deployed in, so it’s able to baseline what is normal and abnormal for the environment. This is really important as every application and business transaction is different. A key reason why many monitoring solutions fail today is that they rely on users to manually define what is normal and abnormal using static or simplistic global thresholds. The classic “alert me if server CPU > 90%” and “alert me if response times are > 2 seconds,” both of which normally result in a full inbox (which everyone loves) or an alert storm for IT to manage.

Communication
The communication bit of analytics is equally as important as the discovery bit. How well can IT interpret and understand what the monitoring solution is telling them? Is the data shown actionable–or does it require manual analysis, knowledge or expertise to arrive at a conclusion? Does the user have to look for problems on their own or does the monitoring solution present problems by itself? A monitoring solution should provide answers rather than questions.

One thing we did at AppDynamics was make analytics central to our product architecture. We’re about delivering maximum visibility through minimal effort, which means our product has to do the hard work for our users. Our customers today are solving issues in minutes versus days thanks to the way we collect, analyze and present monitoring data. If your applications are agile, complex, distributed and virtual then you probably don’t want to spend time telling a monitoring solution what is normal, abnormal, relevant or interesting. Let’s take a look at a few ways AppDynamics Pro is leveraging analytics:

Seeing The Big Picture
Seeing the bigger picture of application performance allows IT to quickly prioritize whether a problem is impacting an entire application or just a few users or transactions. For example, in the screenshot to the right we can see that in the last day the application processed 19.2 million business transactions (user requests), of which 0.1% experienced an error. 0.4% of transactions were classified as slow (> 2 SD), 0.3% were classified as very slow (> 3 SD) and 94 transaction stalled. The interesting thing here is that AppDynamics used analytics to automatically discover, learn and baseline what normal performance is for the application. No static, global or user defined thresholds were used – the performance baselines are dynamic and relative to each type of business transaction and user request. So if a credit card payment transaction normally takes 7 seconds, then this shouldn’t be classified as slow relative to other transactions that may only take 1 or 2 seconds.

The big picture here is that application performance generally looks OK, with 99.3% of business transactions having a normal end user experience with an average response time of 123 milliseconds. However, if you look at the data shown, 0.7% of user requests were either slow or very slow, which is almost 140,000 transactions. This is not good! The application in this example is an e-commerce website, so it’s important we understand exactly what business transactions were impacted out of those 140,000 that were classified as slow or very slow. For example, a slow search transaction isn’t the same as a slow checkout or order transaction – different transactions, different business impact.

Understanding the real Business Impact
The below screenshot shows business transaction health for the e-commerce application sorted by number of very slow requests. Analytics is used in this view by AppDynamics so it can automatically classify and present to the user which business transactions are erroneous, slow, very slow and stalling relative to their individual performance baseline (which is self-learned). At a quick glance, you can see two business transactions–“Order Calculate” and “OrderItemDisplayView”–are breaching their performance baseline.

This information helps IT determine the true business impact of a performance issue so they can prioritize where and what to troubleshoot. You can also see that the “Order Calculate” transaction had 15,717 errors. Clicking on this number would reveal the stack traces of those errors, thus allowing the APM user to easily find the root cause. In addition, we can see the average response time of the “Order Calculate” transaction was 576 milliseconds and the maximum response time is just over 64 seconds, along with 10,393 very slow requests. If AppDynamics didn’t show how many requests were erroneous, slow or very slow, then the user could spend hours figuring out the true business impact of such incident. Let’s take a look at those very slow requests by clicking on the 10,393 link in the user interface.

Seeing individual slow user business transactions
As you can probably imagine, using average response times to troubleshoot business impact is like putting a blindfold over your eyes. If your end users are experiencing slow transactions, then you need to see those transactions to effectively troubleshoot them. For example, AppDynamics uses real-time analytics to detect when business transactions breach their performance baseline, so it’s able to collect a complete blueprint of how those transactions executed across and inside the application infrastructure. This enables IT to identify root cause rapidly.

 In the screenshot above you can see all “OrderCalculate” transactions have been sorted in descending order by response time, thus making it real easy for the user to drill into any of the slow user requests. You can also see looking at the summary column that AppDynamics continuously monitors the response time of business transactions using moving averages and standard deviations to identify real business impact. Given the results our customers are seeing, we’d say this is a pretty proven way to troubleshoot business impact and application performance. Let’s drill into one of those slow transactions…

Visualizing the flow of a slow transaction
Sometimes a picture says a thousands words, and that’s exactly what visualizing the flow of a business transaction can do for IT. IT shouldn’t have to look through pages of metrics, or GBs of log files to correlate and guess why a transaction maybe slow. AppDynamics does all that for you! Look at the screenshot below that shows the flow of a “OrderCalculate” transaction–which takes 63 seconds to execute across 3 different application tiers as shown below. You can see the majority of time spent is calling the DB2 database and an external 3rd party HTTP web service. Let’s drill down to see what is causing that high amount of latency.

Automating Root Cause Analysis
Finding the root cause of a slow transaction isn’t trivial, because a single transaction can invoke several thousand lines of code–kind of like finding a needle in a haystack. Call graphs of transaction code execution are useful, but it’s much faster and easier if the user can shortcut to hotspots. AppDynamics uses analytics to do just that by presenting code hotspots to the user automatically so they can pinpoint the root cause in seconds. You can see in the below screenshot that almost 30 seconds (18.8+6.4+4.1+0.6) was spent in a web service call “calculateTaxes” (which was called 4 times) with another 13 seconds being spent in a single JDBC database call (user can click to view SQL query). Root cause analysis with analytics can be a powerful asset for any IT team.

Verifying Server Resource or Capacity
It’s true that application performance can be impacted by server capacity or resource constraints. When a transaction or user request is slow, it’s always a good idea to check what impact OS and JVM resource is having. For example, was the server maxed out on CPU? Was Garbage Collection (GC) running? If so, how long did GC run for? Was the database connection pool maxed out? All these questions require a user to manually look at different OS and JVM metrics to understand whether resource spikes or exhaustion was occurring during the slowdown. This is pretty much what most sysadmins do today to triage and troubleshoot servers that underpin a slow running application. Wouldn’t it be great if a monitoring solution could answer these questions in a single view, showing IT which OS and JVM resource was deviating from its baseline during the slowdown? With analytics it can.

AppDynamics introduced a new set of analytics in version 3.4.2 called “Node Problems” to do just this. The above screenshot shows this view whereby node metrics (e.g. OS, JVM and JMX metrics) are analyzed to determine if any were breaching their baseline and contributing to the slow performance of the “OrderCalculate” transaction. The screenshot above shows that % CPU idle, % memory used and MB memory used have deviated slightly from their baseline (denoted by blue dotted lines in the charts). Server capacity on this occasion was therefore not a contributing factor to the slow application performance. Hardware metrics that did not deviate from their baseline are not shown, thus reducing the amount of data and noise the user has to look at in this view.

Analytics makes IT more Agile
If a monitoring solution is able to discover abnormal patterns and communicate these effectively to a user, then this significantly reduces the amount of time IT has to spend managing application performance, thus making IT more agile and productive. Without analytics, IT can become a slave to data overload, big data, alert storming and silos of information that must be manually stitched together and analyzed by teams of people. In today’s world, “manually” isn’t cool or clever. If you want to be agile then you need to automate the way you manage application performance, or you’ll end up with the monitoring solution managing you.

If your current monitoring solution requires you to manually tell it what to monitor, then maybe you should be evaluating a next generation monitoring solution like AppDynamics.

App Man.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Finding the Root Cause of Application Performance Issues in Production

The most enjoyable part of my job at AppDynamics is to witness and evangelize customer success. What’s slightly strange is that for this to happen, an application has to slow down or crash.

It’s a bittersweet feeling when End Users, Operations, Developers and many Businesses suffer application performance pain. Outages cost the business money, but sometimes they cost people their jobs–which is truly unfortunate. However, when people solve performance issues, they become overnight heroes with a great sense of achievement, pride, and obviously relief.

To explain the complexity of managing application performance, imagine your application is 100 haystacks that represent tiers, and somewhere a needle is hurting your end user experience. It’s your job to find the needle as quickly as possible! The problem is, each haystack has over half a million pieces of hay, and they each represent lines of code in your application. It’s therefore no surprise that organizations can take days or weeks to find the root cause of performance issues in large, complex, distributed production environments.

End User Experience Monitoring, Application Mapping and Transaction profiling will help you identify unhappy users, slow business transactions, and problematic haystacks (tiers) in your application, but they won’t find needles. To do this, you’ll need x-ray visibility inside haystacks to see which pieces of hay (lines of code) are holding the needle (root cause) that is hurting your end users. This X-Ray visibility is known as “Deep Diagnostics” in application monitoring terms, and it represents the difference between isolating performance issues and resolving them.

For example, AppDynamics has great End User Monitoring, Business Transaction Monitoring, Application Flow Maps and very cool analytics all integrated into a single product. They all look and sound great (honestly they do), but they only identify and isolate performance issues to an application tier. This is largely what Business Transaction Management (BTM) and Network Performance Management (NPM) solutions do today. They’ll tell you what and where a business transaction slows down, but they won’t tell you the root cause so you can resolve the issues.

Why Deep Diagnostics for Production Monitoring Matters

A key reason why AppDynamics has become very successful in just a few years is because our Deep Diagnostics, behavioral learning, and analytics technology is 18 months ahead of the nearest vendor. A bold claim? Perhaps, but it’s backed up by bold customer case studies such as Edmunds.com and Karavel, who compared us against some of the top vendors in the application performance management (APM) market in 2011. Yes, End User Monitoring, Application Mapping and Transaction Profiling are important–but these capabilities will only help you isolate performance pain, not resolve it.

AppDynamics has the ability to instantly show the complete code execution and timing of slow user requests or business transactions for any Java or .NET application, in production, with incredibly small overhead and no configuration. We basically give customers a metal detector and X-Ray vision to help them find needles in haystacks. Locating the exact line of code responsible for a performance issue means Operations and Developers solve business pain faster, and this is a key reason why AppDynamics technology is disrupting the market.

Below is a small collection of needles that customers found using AppDynamics in production. The simple fact is that complete code visibility allows customers to troubleshoot in minutes as opposed to days and weeks. Monitoring with blind spots and configuring instrumentation are a thing of the past with AppDynamics.

Needle #1 – Slow SQL Statement

Industry: Education
Pain: Key Business Transaction with 5 sec response times
Root Cause: Slow JDBC query with full-table scan

Needle #2 – Slice of Death in Cassandra

Industry: SaaS Provider
Pain: Key Business Transaction with 2.5 sec response times
Root Cause: Slow Thrift query in Cassandra

Needle #3 – Slow & Chatty Web Service Calls

Industry: Media
Pain: Several Business Transactions with 2.5 min response times
Root Cause: Excessive Web Service Invocation (5+ per trx)

Needle #4 -Extreme XML processing

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 17 sec response times
Root Cause: XML serialization over the wire.

Needle #5 – Mail Server Connectivity

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 20 sec response times
Root Cause: Slow Mail Server Connectivity

 Needle #6 – Slow ResultSet Iteration

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 30+ sec response times
Root Cause: Querying too much data

Needle #7 – Slow Security 3rd Party Framework

Industry: Education
Pain: All Business Transactions with > 3 sec response times
Root Cause: Slow 3rd party code

Needle #8 – Excessive SQL Queries

Industry: Education
Pain: Key Business Transactions with 2 min response times
Root Cause: Thousands of SQL queries per transaction

Needle #9 – Commit Happy

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 25+ sec response times
Root Cause: Unnecessary use of commits and transaction management.

Needle #10 – Locking under Concurrency

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 5+ sec response times
Root Cause: Non-Thread safe cache forces locking for read/write consistency

 Needle #11 – Slow 3rd Party Search Service

Industry: SaaS Provider
Pain: Key Business Transaction with 2+ min response times
Root Cause: Slow 3rd Party code

 Needle #12 – Connection Pool Exhaustion

Industry: Financial Services
Pain: Several Business Transactions with 7+ sec response times
Root Cause: DB Connection Pool Exhaustion caused by excessive connection pool invocation & queries

Needle #13 – Excessive Cache Usage

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 50+ sec response times
Root Cause: Cache Sizing & Configuration

If you want to manage and troubleshoot application performance in production, you should seriously consider AppDynamics. We’re the fastest growing on-premise and SaaS based APM vendor in the market right now. You can download our free product AppDynamics Lite or take a free 30-day trial of AppDynamics Pro – our commercial product.

Now go find those needles that are hurting your end users!

App Man.

Travel Company Karavel Boosts Application Performance by 20% with AppDynamics

The X-Ray competition winner from last quarter came from an online travel company in France called Karavel who kindly documented their success with AppDynamics Pro. Karavel has been using AppDynamics extensively with custom dashboards, pro-actively alerting and also for optimizing slow business transactions in their production environment.

Here is Karavel’s X-Ray case study as they documented it:

Application Performance Management On-Demand

I thought it would be good to start blogging about my experiences with customers just so you get an idea of how important Application Performance Management (APM) has become.

A few weeks back I met with a customer who had issues, the expression on their face said it all. It started with an apology that several people couldn’t make our meeting, why? because they were investigating a production outage. You might think I’ve just made that up, I can assure you this was real and a frequent event which I’ve witnessed many a time. It can be especially annoying when you’ve travelled many miles to chat with a customer expecting to have a productive meeting and then the alarm bells ring. However, an outage in this scenario just validates the reason why you’re there in the first place.

AppDynamics helps Insurance Customer avoid Production Outage

Last week I published my winning Customer X-Ray of the Quarter, which showed how AppDynamics was able to help a media customer solve a production issue that had plagued their application for over two years. This week I’m posting the runner-up X-Ray entry. This one describes how AppDynamics was able to help an Insurance customer avoid a production outage by spotting a major bottleneck as their application was migrated from dev to pre-production during performance testing. All of the X-Rays you see published in this blog were written by customers, so the stories you read are real, factual, and credible.