Getting to the Root of Swisscom Application Performance

Here’s a guest blog from one of AppDynamics’ international partners, Stefan Zoltai from sysPerform. Stefan wanted to write about how he used AppDynamics to solve a performance problem for a major telecom company in Switzerland—and we said, sure!  Take it away, Stefan…

I’d like to talk about how we used AppDynamics for a major production troubleshooting exercise—and how AppDynamics passed with flying colors.

Swisscom is the leading telecommunications company in Switzerland with about 5.7 million mobile customers and 1.8 million broadband connections. Swisscom is present on the Swiss market with a full portfolio of wireless, wire- and IP-based data and voice-based communication services.

Swisscom’s (Internet) Messaging had engaged sysPerform to assist with the analysis of their Tomcat 6 / Java 1.6 based WebMail application. WebMail has been under scrutiny for about a year now—ever since it manifested both performance and stability problems. Prior analysis efforts, conducted with a number of available tools, did not lead to the determination of the actual root cause(s) since the aforementioned problems only occurred in production under load and could not be reproduced in other environments. WebMail is rated at a throughput of 300 tx/sec.

We realized immediately that without a deep, detailed view into the application’s runtime, in production and under load, we would not be able to determine the actual root cause.

To analyze the application, we selected AppDynamics’ application performance management solution.  Since this solution has been developed specifically for high throughput, distributed production environments, we were able to obtain a high-level overview of the application as well as conduct a deep root cause analysis down to code-level execution without generating measurable overhead. Again, we did all of this at 300 transactions per second of throughput.

Thanks to AppDynamics’ ability to create a dynamic baseline of application performance, we were able to isolate the major bottlenecks on the first day and discuss a solution with the developers at Swisscom.  We were able to quickly learn the application’s performance and stability characteristics — and after only 5 days of development, we deployed a specific, major fix to address the main issue and massively improve performance.  At the moment, we are continuing our analysis efforts since stability and performance are the focus of an ongoing quality process.

[UPDATE: For Swisscom’s perspective on the use of AppDynamics, check out Mika Borner’s blog]

This example clearly demonstrates that operating a modern, distributed application without an adequate monitoring solution is effectively the same as “flying blind.” 60%-80% of all performance problems are caused by the application itself, and need to be analyzed from the inside out. We can confirm these numbers from many of other engagements with similar customers. External causes like hardware or network issues have become increasingly rare; it’s the problems deep inside the application that truly matter.

Intelligent application performance management however is not a means to itself, but must be evaluated in terms of economical considerations as well. Our experience indicates that an APM solution shows an ROI within just a few months. Among the reasons for such a quick ROI is the aforementioned extremely fast root cause analysis.

If you’re reading this in Switzerland, feel free to contact me with questions!

— Stefan Zoltai, Founder, SysPerform GmbH



The ROI of APM: Application Performance and Impatient End Users

How much time will your customers spend on your web site if it slows to a crawl? Will they linger patiently, or will they immediately surf away to a competitor?

According to a 2009 research study by Forrester Consulting, 47% of users expect a web page to load in two seconds or less–and 40% will abandon a web page if it takes more than three seconds to load. Those time measurements have doubled from a similar study published in 2006; as transfer speeds have increased, users expect their web sites to keep pace.

From these numbers, it’s easy to understand the incredible impact of lost revenue for every second that an application performs poorly. Unacceptable performance is a surefire way to cause the end user to surf away and perhaps never return. What is the cost of a transaction taking 4 seconds, when it should only take 1 second? Again, according to the 2009 Forrester study, a matter of seconds constitutes the average online shopper’s expectation for a web page to load. Can you measure customer frustration in terms of its actual revenue impact upon your organization?

The way to measure lost revenue due to poor performance is to determine the application’s Service Level Agreements (SLAs), attach revenue to them, and evaluate how much your organization can preserve through maintaining and even improving those SLAs. For example: how long should the “check out” transaction take? How long should the “add to cart” transaction take?

Once you determine these time frames, you can attach dollar numbers to each of them. For example, let’s say that your SLA is to have the “pay my bill” transaction on a banking site take 1 second. Let’s say that for every time that SLA is violated, the company loses 5 dollars in revenue—due to the aggregate result of some users abandoning their sessions to slow performance.

If the organization is able to determine how often the “pay my bill” transaction is being violated, it can assign a revenue number to the ability to maintain that SLA over time—say, a 99.99% success rate at maintaining the SLAs of 50,000 transactions over a three month period. If the previous success rate had only been 90%, that means 5,000 successful transactions have been rescued. At 5 dollars a transaction, this becomes $25,000 in revenue that never leaves the bottom line.

Obviously, determining and maintaining SLAs can take a lot of work. But the right application performance management system can assist you by learning the behavior of the application, and by creating dynamic baselines that can dramatically help reduce your time to develop SLAs. And once you develop those SLAs, and attach revenue numbers to them, you will quickly see how managing application performance on a proactive basis can help protect your company’s revenue stream.

Monitoring versus Management

Many people are confused by the terms “monitoring” and “management.”  This is because only application monitoring solutions have typically been available in the marketplace, so that’s what people are used to seeing.  If a solution comes along that calls itself “management,” buyers don’t necessarily know the difference.

AppDynamics has embraced the APM label, but we’re aware that people may consider us a monitoring solution. And that’s fine–if they’re looking for that, they’ll get that.  Plus the bonus package.

So what’s the difference?

A monitoring solution collects data points from hardware/software systems and displays it. If you’re monitoring 100 systems and each system has 1,000 metrics, the tool will collect and display those 100,000 metrics.  Then, it’s up to the user to look at the data manually, try to find if problems are occurring, and determine what might be the root cause.  It’s possible to set up alerts that send notifications when a certain metric crosses a threshold.

In short–you get a lot of data. But you have to piece together the data yourself, figure out what story it’s trying to tell, and then take your own action.  It’s useful–but only as much as getting a ride from a friend, who drops you off a mile from your destination and forces you to walk the rest of the way.

Tech blogger Lori MacVittie writes on the subject:

“For a very long time now APM (Application Performance Management) has been a misnomer. It’s always really been application performance monitoring, with very little management occurring outside of triggering manual processes requiring the attention of operators and developers…it has rarely been the case that APM solutions have really been about, well, managing application performance. Certainly they’ve attempted to provide the data necessary to manually manage applications and do an excellent job of correlation and even in some cases deep trouble-shooting of root-cause performance problems. But real-time, dynamic, on-demand performance management? Not so much.”

A performance management solution closes that gap.  It collects data points from hardware/software systems, analyzes them intelligently, and proactively identifies the problems before they impact business users.  It identifies the level and nature of the business impact, pinpoints the root cause of the problem for rapid resolution, and simplifies or even automates remediation.

For example, let’s say your application needs extra resources during peak loads.  A true APM solution will know your application’s historical performance, and it will know when to provision additional resources.  In this instance, the tool is acting on your behalf, helping manage the application without manual intervention.

But application management also works when it helps warn users ahead of time that something bad is about to happen, allowing proactive remediation.  For example, by knowing exactly when performance deviates from the standard baseline, it can alert the user when memory leaks threaten to bring down a machine. That allows the application owner to create a workflow that directs traffic away from the machine, re-starts the machine, and then brings traffic back to the machine.

Knowing where dangers lie ahead of time–and having a clear path to not only root cause analysis, but root cause problem resolution–is a far cry from a typical monitoring solution, which flashes an array of confusing alerts when it’s already too late to avoid hitting the iceberg.

There’s nothing wrong with monitoring.  But until it’s matched with true performance management, it’s going to offer extremely limited utility to the application owner.