Black Friday and Cyber Monday thru the eyes of an APM solution

image_pdfimage_print

A week has passed since Black Friday, so I thought it would be a good idea to summarise what we saw at AppDynamics from monitoring one of several e-commerce applications in production.

Firstly, things went pretty well for our customers who experienced between 300 and 500% increase in transaction volume over the holiday period on their applications. Thats a pretty big spike in traffic for any application so its always good to look at those spikes and see what impact they had on application performance.

Here’s a screenshot which shows the load (top) and response time (bottom) of a major e-commerce production application during the thanksgiving period. The dotted line in both charts represents the dynamic baseline of normal activity. You can see on Black Friday (23rd) and Cyber Monday (26th) that transaction throughput was peaking between 24,000 and 31,000 tpm on the application, spiking between 150 and 200% over the normal load the application experiences throughput the rest of the year.

Application response time during the period had one blip during the first minutes of Black Friday (9pm PCT/Midnight EST) with no major performance issues following thru into Cyber Monday. The blip in the application related to the web container thread pool becoming exhausted during peak load when the Black Friday promotions went live. Below you can see throughput was hitting 23,000 tpm.

Two business transactions “Product Display” and “Checkout” were breaching their performance baselines during that period. Looking at the average response times of 516ms and 733ms tells one story, looking at the maximum response time and number of slow/very slow transactions (calculated using SD) tells a completely different story.

Let’s take a look at the execution of one individual “Product Display” business transaction that was classified as very slow with a 66 second response time.

When we drill into the code execution and SQL activity we can see a simple SELECT SQL query had a response time of 588ms, the problem in this transaction was that this query was invoked 102 times resulting in a whopping 59.9 seconds of latency, its therefore no surprise that thread concurrency inside the JVM was high waiting for transactions like these to complete. If these queries are simply pulling back product data then there is no reason why a distributed cache can’t be used to store the data instead of expensive calls to a remote database like DB2.

Let’s look at the other “Checkout” transaction which was breaching during the performance spike. Here is a checkout which took 9.1 seconds and deviated significantly from its performance baseline. You can see from the screenshot below the latency or bottleneck is again coming from the DB2 database:

Hardly surprising given most application scalability issues these days still relate to data persistence between the JVM and database. So let’s drill down into the JVM for this transaction and understand what exactly is being invoked in the DB2 database:

Above is the code execution of that transaction and you immediately see 8.5 seconds of latency is spent in an EJB call which is performing an update. Let’s take a look at the invoked queries as part of that update:

Nice, a simple update query was taking 8.4 seconds, notice all the other SQL queries associated with a single execution of the “Checkout” transaction. The application during this performance spike was clearly database bound and as a result a few code changes were made overnight that reduced the amount of database calls the application was making. We had one retail e-commerce customer last year who found a similar bottleneck, a fix was applied that reduced the number of database calls per minute from 500,000 to a little under 150,000. While the problem may at first appear to be a database issue (for the DBA) it was actually application logic and the developers who were responsible for resolving the issue.

You can see in the first screenshot that application response time was stable throughout the rest of the thanksgiving period , no spikes or outages occurred for this customer and all was well. While every customer will do their best to catch performance defects in pre-production and test, sometimes its not possible to reproduce or simulate real application usage or patterns, especially in large scale high throughput production environments. This is where Application Performance Management (APM) solutions like AppDynamics can help – by monitoring your application in production so you can see whats happening. Get started today with a free 30-day trial.

Appman

  • http://twitter.com/luisjotapepe Luis Pizarro

    Wel done guys… love the detail

Copyright © 2014 AppDynamics. All rights Reserved.