Apdex is Fatally Flawed

image_pdfimage_print

I’ve been looking into a lot of different statistical methods and algorithms lately and one particularly interesting model is Apdex. If you haven’t heard of it yet, Apdex has been adopted by many companies that sell monitoring software. The latest Apdex specification (v1.1) was released on January 22, 2007 and can be read by clicking here.

Here is the purpose of Apdex as quoted from the specification document referenced above: “Apdex is a numerical measure of user satisfaction with the performance of enterprise applications, intended to reflect the effectiveness of IT investments in contributing to business objectives.” While this is a noble goal, Apdex can lead you to believe your application is working fine when users are really frustrated. If Apdex is all you have to analyze your application performance then it is better than nothing but I’m going to show you the drawbacks and a better way to manage your application performance.

The Formula

The core fatal flaw in the Apdex specification is the formula used to derive your Apdex index. Apdex is an index from 0-1 derived using the formula shown below.

Apdex Formula

At the heart of Apdex is a static threshold (defined as T). The T threshold is set as a measure of the ideal response time for the application. This T threshold is set for an entire application. If you’ve been around application performance for a while you should immediately realize that all application functions perform at very different levels comparatively. For example, my “Search for Flights” function response time will be much greater than my “View Cart” response time. These 2 distinctly different functionalities should not be subjected to the same base threshold (T) for analysis and alerting purposes.

Another thing that application performance veterans should pick up on here is that static thresholds stink. You either set them too high or too low and have to keep adjusting them to a level you are comfortable with. Static thresholds are notorious for causing alert storms (set too low) or for missing problems (set too high). This manual adjustment philosophy leads to a lack of consistency and ultimately makes historical Apdex charts useless if there have been any manual modification of “T”.

So let’s take a look at the Apdex formula now that we understand “T” and it’s multiple drawbacks:

  • Satisfied count = the number of transactions with a response time between 0 and T
  • Tolerating count = the number of transactions with a response time between T and F (F is defined next)
  • F = 4 times T (if T=3 seconds than F=12 seconds)
  • Frustrated count (not shown in the formula but represented in Total samples) = the number of transactions with a response time greater than F

When my colleague Ian Withrow looked at the formula above he had some interesting thoughts on Apdex: “When was the last time you ‘tolerated’ a site loading 4 times slower than you expected it to? If T=1 I’ll tolerate 4 seconds but if T=3 no way am I waiting 12 seconds on a fast connection… which brings me to another point. The notion of a universal threshold for satisfied/tolerating is bunk even for the same page (let alone different pages). My tolerance for how a website loads at work on our fast connection is paper thin. It better be 1-2 seconds or I’m reloading/going elsewhere. On the train to work where I know signal is spotty on my iPhone I’ll load something and then wait for a few seconds.”

Getting back to the Apdex formula itself; assuming every transaction completes in less than the static threshold (T) the Apdex will have a value of 1.00 which is perfect, except the real world is never perfect so let’s explore a more plausible scenario.

An Example

You run an e-Commerce website. Your customers perform 15 clicks while shopping  for every 1 click of the checkout functionality. During your last agile code release the response time of the checkout function went from 4 seconds to over 40 seconds due to an issue with your production configuration. Users are abandoning your website and shopping elsewhere since their checkouts are lost in oblivion. What does Apdex say about this scenario? Let’s see…

T = 4 seconds
Total samples = 1500
Satisfied count = 1300 (assuming most were under 4 seconds with a few between 4 and 16 seconds (F=4T))
Tolerating count = 100 (the assumed number of transactions between 4 and 16 seconds)

Apdex = .90

Apdex Index Table

According to the Apdex specification anything from a .85 – .93 is considered “Good”. We know there is a horrible issue with checkout that is crushing our revenue stream but Apdex doesn’t care. Apdex is not a good way to tell if your application is servicing your business properly. Instead it is a high level reporting mechanism that shows overall application trends via an index. The usefulness of this index is questionable based upon the fact that Apdex is completely capable of manipulation based upon manually changing the T threshold. Apdex, like most other forms of analytics must be understood for what it is and applied to the appropriate problem.

A Better Way – Dynamic Baselines and Analytics

In my experience (and I have a lot of enterprise production monitoring experience) static thresholds and high level overviews of application health have limited value when you want your application to be fast and stable regardless of the functionality being used. The highest value methodology I have seen in a production environment is automatic baselining with alerts based upon deviation from normal behavior. With this method you never have to set a static threshold unless you have a really good reason to (like ensuring you don’t breach a static SLA). The monitoring system automatically figures out normal response time, compares each transactions actual response time to the baseline and classifies each transaction as normal, slow, or very slow. Alerts are based upon how far each transaction has deviated from normal behavior instead of an index value based upon a rigid 4 times the global static threshold like with Apdex.

baselines

Load and response time charts. Baselines are represented by dashed lines.

business baseline

Automatic baseline of business metrics (books sold) . Baseline represented by dashed line.

If you want to make yourself look great to your bosses (momentarily) then set your T threshold to 10 seconds and let the Apdex algorithm make your application look awesome. If you want to be a hero to the business then show them the problems with Apdex and take a free trial of AppDynamics Pro today. AppDynamics will show you how every individual business transaction is performing and can even tell you if your application changes are making or costing you money.

  • Alexander Podelko

    All that is true, but I don’t think that “baselining with alerts” is a better way – I believe that they are just completely different approaches serving different purposes. Apdex is a single-metric high-level health indicator (I guess mainly for business purpose – not designed for IT troubleshooting for sure), “baselining with alerts” is IT approach for production monitoring. You compare a single metric with the stream of metrics / alerts – I don’t think that it is a fair comparison. Perhaps it is possible to say that full analysis of all end user response times with analysis of their location, devices, browsers, etc. would be even more helpful – in case if you get a location-related problem, for example.

    Apdex was a elegant idea for the time (with all its limitations which you so meticulously listed). Don’t forget that it was created in 2005 – probably more with corporate applications in mind and when real end-to-end response times were almost completely unavailable. Well, I guess it is not ideal for today – which probably is the reason that we don’t hear about it recently and the reason why you haven’t heard about it. But I believe that the idea to have a very simple metric for business representing overall health of the system has its merit and maybe we need a way to summarize response times into a simple metric – in addition, of course, to more detailed production monitoring used to run the system.

    • Jim Hirschauer

      Thank you for your well thought out comment Alexander. I agree with you that Apdex is a single level high metric health indicator. The problem I am describing is one where monitoring vendors are stating that it is something more and that it can be used to uncover application performance issues.

      Apdex was an excellent attempt at normalizing application response across many disparate applications but it is definitely a remnant of the past that has been replaced by better methodologies.

      Application response times in and of themselves don’t provide the whole picture without the associated business context. After all, why does IT exist in the first place?

      Like I stated in my blog, Apdex is better than nothing but you need to understand what it really is and what it is actually telling you. Complex applications with hundreds of business transactions connected via many services require a more modern approach.

    • Jim Hirschauer

      Thank you for your well thought out comment Alexander. I agree with you that Apdex is a single level high metric health indicator. The problem I am describing is one where monitoring vendors are stating that it is something more and that it can be used to uncover application performance issues.

      Apdex was an excellent attempt at normalizing application response across many disparate applications but it is definitely a remnant of the past that has been replaced by better methodologies.

      Application response times in and of themselves don’t provide the whole picture without the associated business context. After all, why does IT exist in the first place?

      Like I stated in my blog, Apdex is better than nothing but you need to understand what it really is and what it is actually telling you. Complex applications with hundreds of business transactions connected via many services require a more modern approach.

      • Alexander Podelko

        Who / where stated that Apdex “can be used to uncover application performance issues”? The Apdex purpose you quoted in the very beginning states completely different.

        To my surprise Apdex popped up several times during my conversations at Velocity NY. Several vendors stated that business needs a high-level metric such as Apdex. So it maybe need to be improved / replaced by something better, but the need in such kind of metric definitely exists. And it can’t be replaced with low level / detailed metrics which you need “to uncover application performance issues” – it is just intended for completely different audience, executives.

        • http://www.appdynamics.com/ Jim Hirschauer

          Here is an example of someone stating that ApDex can be used for performance diagnostic… “The Apdex scores provide first level performance problem diagnostics.” from the following link http://www.networkworld.com/community/node/37322

          There are many such references scattered around the internet if you search for them. I’m sure you’ll understand why I’m not going to provide a link directly to a competitors blog site :-)

          Since you brought up executives here is my stance on that… My position is that the business needs business metrics. They should be provided by the IT staff directly from the running applications in as near real time as possible. Providing a nebulous index that can be controlled and contrived by changing the T value should not be accepted by any organization. We are capable of doing much better and I don’t think an executive that would accept one easily manipulated metric as a measure of business performance is worth very much.

  • http://bluesmoon.info Philip Tellis

    ApDex: No users were involved in this measure of user satisfaction. ;)

    PS: Congratulations to Ian on joining AppD and to AppD on snatching him up.

  • Pingback: Apdex is Fatally Flawed - AppDynamics: The APM ...

  • Anonymous

    What about having separate thresholds (T) for the different key transactions on your site to deal with the ‘universal threshold’ issue?

  • Matt Watson

    Using baselines of performance makes sense. But what if the baseline performance isn’t good to begin with? One benefit to Apdex is it sets a goal and measures against it. Using trends and baselines alone does not tell you if the application is performing well. Only if it performs like normal… which could be slow and full of errors… like normal. Measuring against baselines with something like standard deviation is useful to understand if a change in performance has occurred. So I see value in both methodologies for different reasons. Use Apdex for your goals and trending to understand change.

    • http://www.appdynamics.com/ Jim Hirschauer

      You raise an interesting point Matt. In practice (I used to do production operations for a living), deviation from normal was what provided the most value for identifying problems in our operational environment.

      Now if you are talking about setting goals and measuring against them why would you want to muddy the picture by adding an algorithm on top your goal? To me, setting a measurable performance goal is equivalent to saying “Our login transaction will never take longer than 1 second”. If that is the case then you can just get an alert every time that static threshold is breached (or get a report everyday of all breaches).

      At one point a long time ago I was a fan of Apdex but now I consider it to be an outdated notion replaced by better methods.

      What would be really interesting to me is an update of the Apdex algorithm where you can’t manipulate the results by changing a threshold at will.

      • Matt Watson

        People can always manipulate things. The only way to solve that is to hire people you trust and quickly fire people who you can’t.

Copyright © 2014 AppDynamics. All rights Reserved.