Apple has done a stellar job with their development platform and iOS. In fact, they’ve done a stellar job turning my living room into an apple showroom. If you asked me 10 years ago whether my laptop, mouse, keyboard, monitor, phone, music player, TV and tablet would be colored white with an Apple logo I would have probably laughed in your face. The only Microsoft thing left in my house now is an XBOX, and it won’t be long before that turns white as well. Being married also presents a problem in that I now have two of everything, because sharing isn’t caring when it comes to Apple gadgets. With Apple technology being “cool” and widely adopted by millions of users, you can see why every business is migrating their applications to iOS for an improved end user experience. One of our customers recently made the move, and here’s a story of how their new iPhone app crashed their entire mission-critical web application….and I bet you weren’t expecting me to say that, were you?
An unusual spike in performance
Below is screenshot from AppDynamics that shows monitoring data for the customers online web application over the last month. The application has approximately 250 IIS instances, a dozen databases, a dozen web services and a distributed cache.
If we look at the response time of this website, you can see on the 4th November it experienced a big spike. Normal response time of the website is around 100ms, yet the spike caused it to rise to over 1 second. By zooming in on this spike (below screenshot) we can see what day and time the issue occurred on the website. You can see below that load began to decrease around 5pm and didn’t recover til around 7pm representing almost two hours of downtime. You can also see that response time of the application as a whole spiked massively at 5pm causing the whole application to grind to a halt.
Understanding the real business impact
The below screenshot shows what business transactions were executing before, during and after the spike. The information here is sorted by “stalled transactions” or transactions which timed out in layman’s terms. What’s interesting about this screenshot is that all SLA violations (denoted by red) relate to business transactions that begin with “/iphonedata/”, which as you might expect represents user requests from the companies new iPhone application which the customer recently launched. You can also see the first iPhone business transaction had an average response time of 9.2 seconds, had 778 very slow transactions and 1,228 stalls – indicating this transaction had a major problem, unlike all other transactions which weren’t impacted.
Stalled iPhone transactions
Let’s drill into the first iPhone transaction. The below screenshot shows individual user requests of this problematic iPhone transaction, with response times as high as 259 seconds, before transactions eventually stalled and timed out. Let’s drill into one iPhone transaction that took 259 seconds and understand what was responsible for this latency and bottleneck.
Below is the transaction flow of how the 259 seconds was spent across the customers application infrastructure. Notice that the transaction enters the IIS web server from the iPhone device and makes three calls, two of which to a distributed cache and one call to the database using ADO.NET. The next step is to drill-down into the IIS web server shown and see what application logic was responsible for the latency and calls to the distributed cache.
Problems with Memcached
The screenshot below confirms the code hotspots of the iPhone transaction highlighting the classes and method calls responsible for the latency in the cache. You can see that this individual iPhone transaction or user request made multiple calls to memcached (cache) which experienced high latency for both insert and get invocations. Also notice the Threading.waitHandle() API calls that show locking latency for access to the cache.
Understanding what code is responsible
Let’s take a look at the code execution (call graph) to understand what server-side logic is responsible for this data layer bottleneck.The screenshot below shows the code execution responsible for each cache and database call. Notice that the data access layer begins with a namespace called iPhoneDataController.Models, which indicates all iPhone transactions have their own server-side logic which is separate to the logic that services other web application traffic thru web browser clients. What’s interesting about this system wide outage is that it was directly caused by iPhone transactions and server-side logic, which ended up hogging the distributed caches that were being accessed by all other transactions and application logic, causing the entire web application to grind to a halt.
When a transaction stalls in an application, AppDynamics takes a thread dump so the user can understand what resource or API call the thread of execution was waiting on before the transaction timed out. Below is the output of the thread dump from this iPhone transaction which clearly shows the thread (and iPhone business transaction) was waiting on System.Threading.Monitor. Enter to a memcached node that was busy. The iPhone server-side logic in this instance was making multiple insert and get calls to the distributed cache, which in turn was causing contention and locking on the actual cache resource thanks to other business transactions that were trying to access the same resource.
While the root cause of this issue was related to iPhone server-side logic, the problem ended up having a cascade impact on each distributed cache–which was being accessed by other business transactions and logic from the native web application. This unfortunately caused the entire web application to grind to halt until all web servers were restarted to remediate the problem.
Don’t put all your eggs in one basket
One solution to this problem would be to move to an SOA architecture that separates iPhone server-side logic and data access from native server-side logic and data access that drives the main web application, so any failure or bottleneck with the iPhone server-side logic is local and doesn’t end up impacting other parts of the web application. Putting all your application logic in the same place is like putting all your eggs in one basket: when one egg explodes, everything gets very messy very quickly. Unfortunately, in this example, the iPhone app and its transactions killed a mission-critical web application that generates revenue for the business, even though the problem itself wasn’t directly related to logic running on the iPhone device.
If you’re migrating your applications to support mobile devices like iOS, make sure the transactions and logic you introduce doesn’t result in an iOutage.