Here at AppDynamics we’re very proud of how easy our solution is to deploy and maintain. We also tout the fact that in many cases there is no configuration required to gain the insight needed to solve complex application problems. All of this is absolutely true, but does it mean that AppDynamics doesn’t have enough functionality to support complex deployments that make little to no use of common frameworks? Absolutely not! But don’t take our word for it, see for yourself in this presentation by Orbitz (Geoff Kramer and Nick Jasieniecki) at one of our Chicago User Group meetings. In case you don’t know what Orbitz does I think this quote from Geoff Kramer sums it up quite well… “Orbitz unlocks the joy of travel for our customers. You can’t do that if you are having site problems.”
If you don’t have 50 minutes to spare right now I will summarize the videos key points in this blog post.
Insanely Fast Deployment to Production
Wow!!! 2700 agents deployed and monitoring 2700 JVMs in a total of 15 days. I’ve never heard of any deployment of any APM tool, to so many production nodes, so fast in my life. Someone call Guinness, this must be a world record. This deployment was not without challenges though. As Geoff pointed out, they were testing AppDynamics in production with their rolling deployment over 15 days. Orbitz was able to measure the impact of AppDynamics on the running applications between JVMs that were monitored and JVMs that had not yet been cycled through. They bumped into a few minor issues (jar file deployed to wrong location, heap setting too low, etc…) that had quick and painless resolutions and had their entire application environment monitored within 15 days of making the purchase.
After deploying AppDynamics, Orbitz bumped into a problem we see with many of our customers…. “We spent the first week trying to figure out why one host was talking to another when it wasn’t ever supposed to. Turns out, they actually talk together!”
Complexity Rears Its Ugly Head
So far everything sounds pretty straight forward but here is where things start to get more interesting. Orbitz has an extremely complicated architecture. Their environment consists of approximately 140 interconnected applications, globally distributed, many of which make calls to external service providers. The Orbitz website processes millions of requests per day which are made up of hundreds of thousands of different business transactions.
All of these factors add up to a very tall task for any APM product. This is where Orbitz took advantage of advanced functionality within the AppDynamics user interface. They didn’t need to go out to the agent configuration files and manually change their instrumentation like they would be required to do with many legacy APM products. Instead they were able to make all of the necessary configuration updates through the standard user interface and apply these changes to all of the relevant agents.
And as if this wasn’t enough, Orbitz is a very heavy user of site experimentation. They have 48 different permutations of their web pages that are experimented with throughout the course of any given day to see which ones convert the most users. The URLs of these pages do not change so Orbitz had to capture meta-data for all of their slow or failing transactions to understand which permutation had a problem. Again, Orbitz modified the agent configuration through the user interface and pushed these updates to the appropriate agents from the central AppDynamics controller.
Errors and exceptions will occur with any application. Some of these might be perfectly acceptable while others may be the reason for your application outage or poor user experience. Since Orbitz relies upon external data sources to build its web page results a single failed connection is not supposed to impact the response to the end user. Orbitz has retry logic built into their applications so they recover automatically from most connection errors. What’s most important to Orbitz is the rate and overall percentage of errors…
“50 errors per minute means that nothing is wrong.”
“We throw 12 million errors per day when everything is working right.”
Error rates and percentages are interesting from multiple perspectives. Tracking this information from a Business Transaction perspective lets the support team understand exactly what functionality is impacted at any given time. Tracking errors from a Node perspective enables the support team to identify which developers are responsible for the impacted application component.
Orbitz is also taking the same approach with the response time of business transactions and nodes. Responding to, or alerting upon every slow transaction is impractical in their massive environment. By converting the total number of slow transactions into a percentage and rate, Orbitz can avoid alert storms and focus on the issues that have significant impact to their business.
It’s All About the Business
And speaking of the business… Tracking and alerting based upon real business metric anomalies is one of the signs of monitoring maturity. Orbitz is taking advantage of the “Information Point” functionality within AppDynamics to track the actual sales totals and averages over time. They also set up policies to alert if these metrics are deviating from normal behavior. There is no better way to determine if there is business impact from ANY source than through using actual business metrics. All of your technology may be functioning perfectly but if some external source is impact your business someone needs to know about it.
As you can see, Orbitz is really putting AppDynamics through its paces. It’s great that most of our customers don’t need to take advantage of our advanced configuration functionality but it’s awesome when the customers who need it most are so successful.