TAG | Cyber Monday
My wife is a shopoholic and serial checkout killer. Every week she spends several hours browsing and carefully adding items to shopping carts. Every now and then she’ll interrupt my sports program with an important announcement “My Checkout just failed”. Take this example Mrs Appman sent me during the month of September:
The fact I work in the APM industry means it is my responsibility to fix these problems immediately (for my wife). As Black Friday is coming up I thought I’d share with you what typically goes wrong under the covers when our customer’s e-commerce applications go bang during a critical time.
It is worth mentioning that nearly all our customers perform some form of performance or load testing prior to the Black Friday period. Many actually simulate the load from the previous year on test environments designed to reproduce Black Friday activity. While 75% of bottlenecks can be avoided in testing, unfortunately a few surface in production as a result of applications being too big and complex to manage to test under real world scenarios. For example, most e-commerce applications these days span more than 500 servers distributed across many networks and service providers.
Here are the top 10 reasons why eCommerce applications will fail this Black Friday:
1. Database Connection Pool
Nearly every checkout transaction will interact with one or more databases. Connections to this resource are therefore sacred and can often be deadly when transaction concurrency is high. Most application servers come with default connection pool configurations of between 10 and 20. When you consider that transaction throughput for e-commerce applications can easily exceed 100,000 trx/min you soon realize that default pool configurations aren’t going to cut it. When a database connection pool becomes exhausted incoming checkout requests simply wait or timeout until a connection becomes available. Take this screenshot for example:
2. Missing Databases Indexes
This root cause is somewhat related to the exhausted connection pools. Simply put, slow running SQL statements hold onto a database connection for longer, therefore connection pools aren’t recycled as often as they should be as queries take longer. The number 1 root cause of slow SQL statements is missing indexes on database tables, which is often caused by miss-communication between developers who write SQL, and the DBAs who configure and maintain the database schemas which hold the data. The classic “full table scan” query execution where a transaction and its database operation must scan through all the data in a table before a result is returned. Here is an example of what such looks like in AppDynamics:
3. Code Deadlock
High transaction concurrency often means application server threads have to contend more for application resource and objects. Most e-commerce applications have some form of atomicity build in to their transactions, so that order and stock volumes are kept in check as thousands of users fight over special offers and low prices. If access to application resource is not properly managed some threads can end up in deadlock, which can often cause an application server and all its user transactions to hang and timeout. One example of this was last year where an e-commerce customer was using a non-thread safe cache. Three threads tried to perform a get, set and remove on the same cache at the same time causing code deadlock to occur, impacting over ~2,500 checkout transactions as the below screenshot shows.
4. Socket Timeout Exceptions
Server connectivity is an obvious root cause, if you check your server logs using a Sumologic or Splunk then you’ll probably see hundreds of these events. They represent network problems or routing failures where a checkout transaction is attempting to contact one or more servers in the application infrastructure. Most of the time the services you are connecting to aren’t your own, for example a shipping provider, credit card processor, or fraud detector. On high traffic days like Black Friday it isn’t just your site experiencing a surge in traffic – often times entire networks are saturated due to intense demand. After a period of time (often 30-45 secs) the checkout transaction will just give up, timeout and return an error to the user. No Connectivity = No Revenue. Here is an example of what it looks like:
5. Garbage Collection
Caches are an easy way to speed up applications. The closer data is to application logic (in memory) the faster it executes. It is therefore no surprise that as memory has gotten bigger and cheaper most companies have adopted some form of in-memory caching to eliminate database access for frequent used results. The days of 64GB and 128GB heaps are now upon us which means the impact of things like Garbage Collection are more deadly to end users. Maintaining cache data and efficiently creating/persisting user objects in memory becomes paramount for eliminating frequent garbage collection cycles. Just because you have GB’s of memory to play with doesn’t mean you can be lazy in how you create, maintain and destroy objects. Here is are a few screenshots that show how garbage collection can kill your e-commerce application:
6. Transactions with High CPU Burn
Its no secret than inefficient application logic will require more CPU cycles than efficient logic. Unfortunately the number 1 solution to slow performance in the past was for eCommerce vendors to buy more servers. More servers = More Capacity = More Transaction Throughput. While this calculation sounds good, the reality is that not all e-commerce transactions are CPU bound. Adding more capacity just masks inefficient code in the short term, and can waste you significant amounts of money in the long term. If you have specific transactions in your eCommerce application that hog or burn CPU then you might want to consider tuning those before you whip out your check book with Oracle or Dell. For example:
7. 3rd Party Web Services
If your e-commerce application is built around a distributed SOA architecture then you’ll have multiple points of failure. Especially if several of those services are provided by a 3rd party where you have no visibility. For example, most payment and credit card authorization services are provided by 3rd party vendors like PayPal, Stripe, or Braintree. If these services slow down or fail then its impossible for checkout transactions to complete. You therefore need to monitor these services religiously so when problems occur you can rapidly identify whether it is your code or connectivity or someone else’s outage. Here is example of how AppDynamics can help you monitor your 3rd party web services:
8. Crap Recursive Code
This is similar to #6 but burns time instead of resources. For example, many e-commerce transactions will request data from multiple sources (caches, databases, web services) at the same time. Every one of these round trips could be expensive and may involve network time along the way. I’ve seen a single eCommerce search transaction call the same database multiple times instead of performing a single operation using a stored procedure on the database. Recursive remote calls may only take 10-50 millisecond each, but if they are invoked multiple times per transaction they can add seconds to your end user experience. For example, here is that search transaction that took x seconds and made 13,000 database calls.
9. Configuration Change
As much as we’d like to think that production environments are “locked down” with change control process, they are not. Accidents happen, humans make mistakes and hotfixes occasionally get applied in a hurry at 2am Application server configuration can be sensitive just like networks, or any other pieces of the infrastructure. Being able to audit, report and compare configuration change across your application gives you instant intelligence that a change may have caused your eCommerce application to break. For example, AppDynamics can record any application server change and show you the time and values that were updated to help you correlate change with slowdowns and outages, see below screenshot.
10. Out of Stock Exception
“I’m sorry, the product you requested is no longer in stock”. This basically means you were too slow and you’ll need to wait until 2014 for the same offer. Remember to set an alarm next year for Black Friday
In addition, AppDynamics can also monitor the revenue and performance of your checkout transactions over-time which helps Dev and Ops teams monitor and alert on the health of the business:
The good news is that AppDynamics Pro can identify all of the above defects in minutes. You can take a free trial here and be deployed in production in under 30 minutes! If you send us a few screenshots of your findings in production like the above we’ll send you a $250 Amazon gift certificate for your hard work!
Steve.Link to this post:
The reason for this blog is purely down to a real-life incident which one of our e-commerce customers shared with us this week. It’s based around a use case that pretty much anyone can relate to – the moment your checkout transaction spectacularly fails. You sit there, looking at a big fat error message and think “WTF – did my transaction complete or did the company steal my money?” A minute later you’re walking a support team through exactly what happened: “I just clicked Checkout and got an error…honestly…I waited and never got a response.”
What’s different in this story is that the support team had access to AppDynamics as they were talking to a customer on the phone…and the customer got to find out the real reason their checkout failed. How often does that happen? Never, until now. Here is the story as documented by the customer.
A week has passed since Black Friday, so I thought it would be a good idea to summarise what we saw at AppDynamics from monitoring one of several e-commerce applications in production.
Firstly, things went pretty well for our customers who experienced between 300 and 500% increase in transaction volume over the holiday period on their applications. Thats a pretty big spike in traffic for any application so its always good to look at those spikes and see what impact they had on application performance.
Here’s a screenshot which shows the load (top) and response time (bottom) of a major e-commerce production application during the thanksgiving period. The dotted line in both charts represents the dynamic baseline of normal activity. You can see on Black Friday (23rd) and Cyber Monday (26th) that transaction throughput was peaking between 24,000 and 31,000 tpm on the application, spiking between 150 and 200% over the normal load the application experiences throughput the rest of the year.
Application response time during the period had one blip during the first minutes of Black Friday (9pm PCT/Midnight EST) with no major performance issues following thru into Cyber Monday. The blip in the application related to the web container thread pool becoming exhausted during peak load when the Black Friday promotions went live. Below you can see throughput was hitting 23,000 tpm.
Two business transactions “Product Display” and “Checkout” were breaching their performance baselines during that period. Looking at the average response times of 516ms and 733ms tells one story, looking at the maximum response time and number of slow/very slow transactions (calculated using SD) tells a completely different story.
Let’s take a look at the execution of one individual “Product Display” business transaction that was classified as very slow with a 66 second response time.
When we drill into the code execution and SQL activity we can see a simple SELECT SQL query had a response time of 588ms, the problem in this transaction was that this query was invoked 102 times resulting in a whopping 59.9 seconds of latency, its therefore no surprise that thread concurrency inside the JVM was high waiting for transactions like these to complete. If these queries are simply pulling back product data then there is no reason why a distributed cache can’t be used to store the data instead of expensive calls to a remote database like DB2.
Let’s look at the other “Checkout” transaction which was breaching during the performance spike. Here is a checkout which took 9.1 seconds and deviated significantly from its performance baseline. You can see from the screenshot below the latency or bottleneck is again coming from the DB2 database:
Hardly surprising given most application scalability issues these days still relate to data persistence between the JVM and database. So let’s drill down into the JVM for this transaction and understand what exactly is being invoked in the DB2 database:
Above is the code execution of that transaction and you immediately see 8.5 seconds of latency is spent in an EJB call which is performing an update. Let’s take a look at the invoked queries as part of that update:
Nice, a simple update query was taking 8.4 seconds, notice all the other SQL queries associated with a single execution of the “Checkout” transaction. The application during this performance spike was clearly database bound and as a result a few code changes were made overnight that reduced the amount of database calls the application was making. We had one retail e-commerce customer last year who found a similar bottleneck, a fix was applied that reduced the number of database calls per minute from 500,000 to a little under 150,000. While the problem may at first appear to be a database issue (for the DBA) it was actually application logic and the developers who were responsible for resolving the issue.
You can see in the first screenshot that application response time was stable throughout the rest of the thanksgiving period , no spikes or outages occurred for this customer and all was well. While every customer will do their best to catch performance defects in pre-production and test, sometimes its not possible to reproduce or simulate real application usage or patterns, especially in large scale high throughput production environments. This is where Application Performance Management (APM) solutions like AppDynamics can help – by monitoring your application in production so you can see whats happening. Get started today with a free 30-day trial.
AppmanLink to this post: