My wife is a shopoholic and serial checkout killer. Every week she spends several hours browsing and carefully adding items to shopping carts. Every now and then she’ll interrupt my sports program with an important announcement “My Checkout just failed”. Take this example Mrs Appman sent me during the month of September:
The fact I work in the APM industry means it is my responsibility to fix these problems immediately (for my wife). As Black Friday is coming up I thought I’d share with you what typically goes wrong under the covers when our customer’s e-commerce applications go bang during a critical time.
It is worth mentioning that nearly all our customers perform some form of performance or load testing prior to the Black Friday period. Many actually simulate the load from the previous year on test environments designed to reproduce Black Friday activity. While 75% of bottlenecks can be avoided in testing, unfortunately a few surface in production as a result of applications being too big and complex to manage to test under real world scenarios. For example, most e-commerce applications these days span more than 500 servers distributed across many networks and service providers.
Here are the top 10 reasons why eCommerce applications will fail this Black Friday:
1. Database Connection Pool
Nearly every checkout transaction will interact with one or more databases. Connections to this resource are therefore sacred and can often be deadly when transaction concurrency is high. Most application servers come with default connection pool configurations of between 10 and 20. When you consider that transaction throughput for e-commerce applications can easily exceed 100,000 trx/min you soon realize that default pool configurations aren’t going to cut it. When a database connection pool becomes exhausted incoming checkout requests simply wait or timeout until a connection becomes available. Take this screenshot for example:
2. Missing Databases Indexes
This root cause is somewhat related to the exhausted connection pools. Simply put, slow running SQL statements hold onto a database connection for longer, therefore connection pools aren’t recycled as often as they should be as queries take longer. The number 1 root cause of slow SQL statements is missing indexes on database tables, which is often caused by miss-communication between developers who write SQL, and the DBAs who configure and maintain the database schemas which hold the data. The classic “full table scan” query execution where a transaction and its database operation must scan through all the data in a table before a result is returned. Here is an example of what such looks like in AppDynamics:
3. Code Deadlock
High transaction concurrency often means application server threads have to contend more for application resource and objects. Most e-commerce applications have some form of atomicity build in to their transactions, so that order and stock volumes are kept in check as thousands of users fight over special offers and low prices. If access to application resource is not properly managed some threads can end up in deadlock, which can often cause an application server and all its user transactions to hang and timeout. One example of this was last year where an e-commerce customer was using a non-thread safe cache. Three threads tried to perform a get, set and remove on the same cache at the same time causing code deadlock to occur, impacting over ~2,500 checkout transactions as the below screenshot shows.
4. Socket Timeout Exceptions
Server connectivity is an obvious root cause, if you check your server logs using a Sumologic or Splunk then you’ll probably see hundreds of these events. They represent network problems or routing failures where a checkout transaction is attempting to contact one or more servers in the application infrastructure. Most of the time the services you are connecting to aren’t your own, for example a shipping provider, credit card processor, or fraud detector. On high traffic days like Black Friday it isn’t just your site experiencing a surge in traffic – often times entire networks are saturated due to intense demand. After a period of time (often 30-45 secs) the checkout transaction will just give up, timeout and return an error to the user. No Connectivity = No Revenue. Here is an example of what it looks like:
5. Garbage Collection
Caches are an easy way to speed up applications. The closer data is to application logic (in memory) the faster it executes. It is therefore no surprise that as memory has gotten bigger and cheaper most companies have adopted some form of in-memory caching to eliminate database access for frequent used results. The days of 64GB and 128GB heaps are now upon us which means the impact of things like Garbage Collection are more deadly to end users. Maintaining cache data and efficiently creating/persisting user objects in memory becomes paramount for eliminating frequent garbage collection cycles. Just because you have GB’s of memory to play with doesn’t mean you can be lazy in how you create, maintain and destroy objects. Here is are a few screenshots that show how garbage collection can kill your e-commerce application:
6. Transactions with High CPU Burn
Its no secret than inefficient application logic will require more CPU cycles than efficient logic. Unfortunately the number 1 solution to slow performance in the past was for eCommerce vendors to buy more servers. More servers = More Capacity = More Transaction Throughput. While this calculation sounds good, the reality is that not all e-commerce transactions are CPU bound. Adding more capacity just masks inefficient code in the short term, and can waste you significant amounts of money in the long term. If you have specific transactions in your eCommerce application that hog or burn CPU then you might want to consider tuning those before you whip out your check book with Oracle or Dell. For example:
7. 3rd Party Web Services
If your e-commerce application is built around a distributed SOA architecture then you’ll have multiple points of failure. Especially if several of those services are provided by a 3rd party where you have no visibility. For example, most payment and credit card authorization services are provided by 3rd party vendors like PayPal, Stripe, or Braintree. If these services slow down or fail then its impossible for checkout transactions to complete. You therefore need to monitor these services religiously so when problems occur you can rapidly identify whether it is your code or connectivity or someone else’s outage. Here is example of how AppDynamics can help you monitor your 3rd party web services:
8. Crap Recursive Code
This is similar to #6 but burns time instead of resources. For example, many e-commerce transactions will request data from multiple sources (caches, databases, web services) at the same time. Every one of these round trips could be expensive and may involve network time along the way. I’ve seen a single eCommerce search transaction call the same database multiple times instead of performing a single operation using a stored procedure on the database. Recursive remote calls may only take 10-50 millisecond each, but if they are invoked multiple times per transaction they can add seconds to your end user experience. For example, here is that search transaction that took x seconds and made 13,000 database calls.
9. Configuration Change
As much as we’d like to think that production environments are “locked down” with change control process, they are not. Accidents happen, humans make mistakes and hotfixes occasionally get applied in a hurry at 2am 😉 Application server configuration can be sensitive just like networks, or any other pieces of the infrastructure. Being able to audit, report and compare configuration change across your application gives you instant intelligence that a change may have caused your eCommerce application to break. For example, AppDynamics can record any application server change and show you the time and values that were updated to help you correlate change with slowdowns and outages, see below screenshot.
10. Out of Stock Exception
“I’m sorry, the product you requested is no longer in stock”. This basically means you were too slow and you’ll need to wait until 2014 for the same offer. Remember to set an alarm next year for Black Friday 😉
In addition, AppDynamics can also monitor the revenue and performance of your checkout transactions over-time which helps Dev and Ops teams monitor and alert on the health of the business:
The good news is that AppDynamics Pro can identify all of the above defects in minutes. You can take a free trial here and be deployed in production in under 30 minutes! If you send us a few screenshots of your findings in production like the above we’ll send you a $250 Amazon gift certificate for your hard work!
Steve.