Most enterprise databases today run on shared storage volumes (SAN, NAS, etc…) that are accessed over the network or via Fibre Channel connection. The shared storage concept is great for helping to keep storage infrastructure and management costs relatively low but creates cross silo finger pointing when there are performance issues. In this blog post we will explore a real world example of how to avoid finger pointing and get right down to figuring out how to fix the problem.
One Rotten Apple Can Ruin The Whole Bunch
This story dates back to June of 2012 but I just came across it so it is new to me. One of our customers had an event which impacted the performance of multiple databases. All of these databases were connected to the same NetApp storage array. Often when there is an issue with database performance the DBAs will point the finger at the storage team and the storage team will tell the DBA team that everything looks good on their side. This finger pointing between silo’s is a common occurrence between various groups (network, storage, database, application support, etc…) within enterprise organizations.
In the chart below (screen grab taken from AppDynamics for Databases) you can see that there was a significant increase in I/O activity on dw_logvol. This issue impacted the performance of the entire NetApp storage array.
As it turns out dw_logvol was used as a temporary storage location for web logs. There was a process that would copy log files to this location, decompress them, and insert them into an Oracle data warehouse for long term storage. This process normally would not impact the performance of anything else connected to the same storage array but in this case there happened to be corrupted log files that could not be properly decompressed. This resulted in multiple attempts to retransmit and decompress the same files.
Context and Collaboration to the Rescue
Storage teams normally don’t have access to application context and application teams normally don’t have access to storage metrics. In this case though, both teams were able to collaborate and quickly realize what the problem was as a result of having a monitoring solution that was available to everyone. The fix for this problem was really easy, just remove the corrupted files and replace them with versions without any corruption. You can see activity return to normal in the chart below.
Modern application architectures require collaboration across all silo’s in order to identify and fix issues in a timely manner. One of the key enablers of cross-silo collaboration is intelligent monitoring at each layer of the application and the infrastructure components that provide the underlying resources. AppDynamics provides end-to-end visibility in an analytics based solution that help you identify, isolate and remediate issues. Try AppDynamics for Databases and Storage for free today and bring a new level of collaboration to your organization.Link to this post:
In the third part of this series I discussed preparing to launch a mobile application with load testing and beta testing and highlighted the differences between the iOS and Android ecosystems. In this post I will dive into monitoring your production mobile application.
Production consideration: Crash and Error Reporting
Crash and error reporting is a requirement not only for the development of your application, but also for testing and production. There are quite a few crash-reporting tools available including AppDynamics, Crashlytics, Crittercism, NewRelic, BugSense, HockeyApp, InstaBug, and TestFlight. All of these tools have are capable of reporting fatal errors in your application to help developers track down the root cause of bugs. The problem with crash and error reporting is that it only tracks issues after they have affected users. Both the Apple App Store and the Google Play Store provide basic crash reporting metrics.
The harsh reality is that mobile applications have a fickle audience that is heavily reliant on curated app stores and reviews. Reviews can make or break a mobile application, as can being featured in an app store. A common best practice is to allow in-app feedback to preempt negative reviews by engaging the user early on. There are a variety of services that make it easy to provide in-app feedback like Apptentive, Appboy, and Helpshift.
This is why being proactive with quality assurance and production monitoring has a significant impact on the success of an application. Not only must the application work as designed, but also the experience must be polished. The expectation in the mobile community is significantly higher than on the web.
Production consideration: Analytics & Instrumentation
Smart executives are data driven and mobile applications can be a plethora of business intelligence. When it comes to instrumentation the earlier you instrument and more metrics you track, the better informed you will be. Analytics and instrumentation are crucial for making informed and smart decisions about your business.
Who is your audience? What platforms and devices do they use? What user flows are the most common? Why do users abandon? Where are users located? What is the performance of your application?
Tracking important demographics of your audience like operating systems, devices, carriers, application versions, and geography of users is key. These metrics allow you to target your limited resources to where they are needed most. There are quite a few analytics platforms built for mobile including Google Analytics, Flurry, Amazon Analytics, FlightPath, MixPanel, KissMetrics, Localytics, and Kontagent.
All of these tools will give you better insights into your audience and enable you to make smarter decisions. There are important metrics to track like total # of installations, average session lifetime, engagement and growth, and geography of your users. Once you have basic user demographics you can use MixPanel or KissMetrics to track user activity with custom event tracking. The more instrumentation you add to your application the more metrics and customer intelligence you will have to work with.
Production consideration: Application Performance Monitoring
Application performance management tools enable you to discover the performance of your mobile and server-side applications in production. APM allows you to understand your application topology, third party dependencies, and the performance of your production application on both the client-side and the server-side. Modern application performance management solutions like AppDynamics track crashes and errors, the performance of the mobile application, and correlates performance to your backend platform, while providing rich user demographics and metrics on the end user experience. With modern business reporting tools you can evaluate the health of your application and be proactive when performance starts to deteriorate.
End User Monitoring allows you to understand the application experience from the eyes of your real users. There are quite a few solutions to monitoring the end user experience in the market place. AppDynamics, Crittercism, New Relic, and Compuware allow you in instrument your application and gain visibility into production performance problems.
Business consideration: Real-time business metrics
Once you have launched a successful mobile experience you need to understand how that experience affects your business. If you have a business critical application like the Apple Store Checkout application or the Fedex Package Management application your business is dependent on performance and use of your application. You can gain valuable insight into your business if you track and correlate the right metrics. For example, how does performance affect revenue, or what is the average price for a checkout transaction. Understanding your core business metrics and correlate them to your mobile experience for maximum business impact.
Business consideration: Monetization
Your plan is to retire off this application, so you need to have a monetization strategy. The most common ways to make money from applications are Pay to play (charge a fee for your application), freemium (offer a free and pro upgrade), offer in-app purchases (like levels, tokens, and credit), and traditional advertising. There are many services to enable mobile advertising like Apple’s iAds, Google’s Admob, Amazon’s Mobile Ads, Flurry, inMobi, Millennial Media, and moPub. All of these strategies require precision execution, but some strategies work better for specific types of apps. Experiment with multiple strategies and do what works best for your business.
It is no longer enough just to have a presence on the web. In fact more and more companies are going mobile first. The mobile landscape is constantly evolving and the mobile market is seeing continued growth year over year.
Want to start monitoring your iOS or Android application today? Sign up for our beta program to get the full power of AppDynamics for your mobile apps. Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.
As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.Link to this post:
Every man and his dog knows that Garbage Collection (GC) is a “stop the world” event. Java memory becomes filled with redundant objects/data which over time must be collected so that memory can be reclaimed and reused by the JVM. When the GC thread in the JVM kicks in all other threads grind to a halt, giving the end user an impression that the application is slow. It’s not such a big deal when GC runs for a few seconds, say, every minute, as this is typical for an active application that persists data frequently from databases (disk) to JVM (memory). The problem is when GC takes more than a few seconds, especially now that JVM memory and heaps can be as large as 16GB – collecting MBs can take seconds, but GBs can take minutes. I was troubleshooting a performance issue with a customer’s application last week and witnessed significant GC times of around 90 seconds. Here is what I found while troubleshooting.
The customer reported an application slowdown around 7pm in the evening. This application processes on average around 25 million business transactions a day, which is roughly 18,000 transactions a minute. It has 20+ JVMs, 2 relational databases, an MQ messaging backbone and several 3rd party web services. Here is the application topology (system map) showing the traffic and performance spike:
The first place I checked was the system OS metrics, specifically looking for high CPU, memory, disk I/O and network I/O – the classic KPIs a typical sys admin might check. Everything looked OK:
So I then looked at JVM metrics to check CPU, memory (heap) and garbage collection activity. This is where I noticed one JVM was spending a significant amount of time in GC, an average of 23 seconds per minute compared, to the others which were averaging 2-3 seconds. You can see from the screenshot below that one JVM had 17 major collections with most other JVMs just having a single collection:
Next I drilled down into this JVM and plotted its GC time, % CPU Busy and % Heap utilization before, during and after the reported slowdown at 7pm. You can see in the screenshot below the sudden leap in GC time (top green line) from around 5 seconds per minute to well over 100 seconds per minute. Also notice how the heap (red line) has because exhausted and that CPU spiked when GC kicked in as expected.
The question I asked myself was “what in the application caused this excessive GC?”. Sure, I could have just upped the heap size on this JVM to prolong the inevitable death by GC, but this was a mission-critical application we’re talking about, and band-aids are not that fashionable these days. Besides, increasing JVM Heap can often make GC worse when it eventually kicks in. The next thing I did was analyze the performance and resource consumption of business transactions that were running through the offending JVM.
The screenshot below shows that the Search business transaction (circled) had an average response time of 2 seconds (good) but it also had a maximum response time of 267 seconds (not good). Combined with the fact that this transaction is invoked over 60 times a minute, had hundreds of errors, slow requests and stalls, and was consuming the most amount of CPU (averaging > 600ms per transaction), I decided to investigate this transaction further.
I plotted (see below) the response time (pink line) of the search transaction and its calls per minute (blue line) with the earlier chart of JVM GC time, % CPU Busy and % Heap utilized. You can see that as calls per minute for the search transaction reaches 10, GC time begins to increase along with % CPU busy and a spike in the search transaction response time.
I then took a look at the code execution of a search business transaction to get a sense of what might be causing GC to suddenly kick in. The search transaction below took 38.7 seconds to complete, of which 40% of that time was spent burning CPU in the JVM.
Looking at the code execution and hotspots for this specific search transaction you can see immediately the excessive number of EJB calls and subsequent JDBC queries. The search transaction was basically creating a ProductDataBean for every search result, with every Bean persisting data from the database into JVM memory.
It was actually worse than I thought, though. I naively assumed each Bean would be doing a single query to get the data it needed per product. The reality (below screenshot) is that each Bean was invoking hundreds of SQL queries. In total, for a single search transaction taking 38,7 seconds, over 12,000 SQL queries were invoked by the application in the JVM. You can see the average response time of each query is around 1 milliseconds (very fast) but the fact many queries were called thousands of time meant the cumulative response time was high, as shown.
Yes, the search transaction was slow on this occasion, but the number of objects being instantiated and the amount of data being persisted by each search transaction was the clear issue here. The broader a search, the more search results returned and the more objects/data being persisted in the JVM, the more often GC will occur. A big problem with a transaction like search is that many users will simply search again or refresh their search hoping that their new search will be faster. I’ve seen this in CRM applications where call center agents will be talking to a customer while they try to retrieve the customer’s details. After 10 seconds the agent will become inpatient and perform the same customer search again. This just makes things worse, as the user kicks off another transaction which then performs the same workload as the previous search. In the customer application example above, every search can potentially exhaust the JVM heap depending on the concurrency of the number of results returned by each search. The recommendation for the above customer’s application is to optimize the objects being created to limit the impact of concurrency.
Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.
As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.Link to this post:
In the second part of this series I discussed developing a mobile application and choosing a backend platform and building for various network conditions. In this post I will dive into some considerations when launching a mobile application.
Mobile app audiences are a notoriously fickle bunch and a poor first impression often results in a very harsh app store review that will negatively impact your apps growth. When an app store rating can make or break your application you have to be diligent in making sure every user has a stellar experience. The best way to do this is thoroughly test your mobile experience and load testing your backend to ensure you can handle peak traffic.
The key to a successful launch is great planning and testing. Launching mobile applications are significantly more difficult than the common web application. Not only is the audience more fickle, but you also have to adhere to third-party processes and procedures. Thorough quality assurance, crash and error reporting, load testing, and proactive production monitoring are essential to launching a successful mobile application.
Launch consideration: Testing native applications across mobile devices
Testing mobile applications is notoriously difficult due to the vast number of devices. There are a few services that make this easier for engineers. I have seen a few strategies for testing mobile devices – usually you go to Amazon and buy the top twenty devices for Android and iOS and manually test your application across every device manually. Mobile device labs of this sort are quite expensive to setup and maintain and often require some level of automation to be productive. Alternatives to setting up your own mobile lab is to use a mobile app testing platform like TheBetaFamily. They offer an easy way to test your native application across many different devices and audiences with ease.
Launch consideration: Capacity planning and load testing
Capacity planning is key to the successful launch of any web application (or mobile backend). If you want to understand what can go wrong look no further than the failure to launch of healthcare.gov. Understanding your limits and potential capacity is a requirement for planning how much traffic you can handle. Making an educated assumption about potential growth and you can come up with a plan for how many concurrent users you might need to support.
Once you understand your maximum concurrent users you can test your backend infrastructure to be sure your mobile experience doesn’t suffer. There are quite a few tools available to help you load test and evaluate the scalability of your backend platform. Apica, Soasta, and Blazemeter offer services that allow you to simulate your mobile application being used at high levels of concurrency.
Launch consideration: Beta testing
Beta testing is the last quality assurance step before you can make your app generally available. Testflight, HockeyApp, and Ubertesters allow you to distribute your application for testing to a select group of users. When it comes to beta testing the more users you can convince to give feedback and the larger distribution of devices the better. These beta testing and distribution tools enable you to easily gather feedback early on about what isn’t working in your application and save you from the embarrassment of negative app store reviews due to obvious problems. A/B testing is also a great way to find out which flows work best as part of your beta testing experience. This is an essential step to a successful launch – the more beta testers you can find the better.
Launch consideration: Hard launch or Soft launch?
Once you have beta tested and decided you have a great application that is battle tested for production you need to decide how to launch. The real question is hard launch or soft launch. The traditional hard launch is straightforward. Your app is approved in the app store and you go live. There are a few different strategies for soft launches of major applications. The most common is to soft launch outside of your primary market. If you are planning to release in the USA you can easily pick another region with similar characteristics like Canada, Australia, or the United Kingdom. The benefit of soft launching in a secondary market means you can validate assumptions earlier and beta testing your key audience. Soft launching can validate product/market fit, app experience, usability, and app/game/social mechanics. The result is your first experience with your key demographic will be based on the data you learned from your sample audience. The end result will be a much more polished and proven app experience.
Launch consideration: App store submission process
The application submission process varies greatly depending on the app store. This is where you get to sell your application with a marketing description, search keywords, and screenshots of your app in action. You can specify pricing and what regions/markets you want your app to be available in.
With Apple it is customary to wait up to two weeks for Apple to review your application and approve it for production. Apple routinely rejects applications for being low quality, using unsupported APIs, and for not following design guidelines. Google on the other hand offers a streamlined release process that takes less than one hour, but doesn’t offer the first line of protection that Apple provides by not allowing apps with obvious flaws.
Mobile insights with AppDynamics
With AppDynamics for Mobile, you get complete visibility into the end user experience across mobile and web with the end user experience dashboard. Get a better understanding of your audience and where to focus development efforts with analytics on device, carriers, OS, and application versions:
Want to start monitoring your iOS or Android application today? Sign up for our beta program to get the full power of AppDynamics for your mobile apps.
In the next post in this series I will dive into monitoring a production mobile app and the various tools that are available. As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.
Take five minutes to get complete visibility into the performance of your production applications with AppDynamics Pro today.Link to this post: