TAG | integration
There has been a wealth of speculation about what might be wrong with the healthcare.gov website from some of the best and brightest technology has to offer. It seems like everyone has an opinion about what is wrong yet there are still very few facts available to the public. One of our Engineers (Evan Kyriazopoulos) at AppDynamics did some research using our End User Monitoring platform and has some interesting findings that I will share with you today.
What We Already Know
Healthcare.gov is an architecture of distributed services built by a multitude of contractors, and integrates with many legacy systems and insurance companies. CGI Federal (the main contractor) said that it was working with the government and other contractors “around the clock” to improve the system, which it called “complex, ambitious and unprecedented.” We’ve put together a logical architecture diagram based upon the information that is currently in the public domain. This logical diagram could represent hundreds or even thousands of application components and physical devices that work in coordination to service the website.
How We Tested
What we found after visiting multiple pages was that the problems with response time are pretty evenly divided between server response time issues and browser processing time issues. In the screenshot below you can see the average of all response times for the web pages we visited during this exploratory session. The “First Byte Time” part of the graphic is the measure of how long the server took to get its initial response to the end user browser. Everything shown in blue/green hues represents server side response times and everything in purple represents browser client processing and rendering. You can see really bad performance on both the client and server side on average.
Now let’s take a look at some individual page requests to see where their problems were. The image below shows an individual request of the registration page and the breakdown of response times across the server and client sides. This page took almost 71 seconds to load with 59 seconds attributed to the server side and almost 12 seconds attributed to the client side data loading and rendering. There is an important consideration with looking at this data. Without having access to either the server side to measure its response time, or to all network segments connecting the server to the client to measure latency we cannot fully determine why the client side is slow. We know there are js and css file optimizations that should be made but if the server side and/or network connections are slow then the impact will be seen and measured at the end user browser.
Moving past registration… The image below shows a “successful” load of a blank profile page. This is the type of frustrating web application behavior that drives end users crazy. It appears as though the page has loaded successfully (to the browser at least) but that is not really the case.
When we look at the list of our end user requests we see that there are AJAX errors associated with the myProfile web page (right below the line highlighted in blue).
So the myProfile page successfully loaded in ~3.4 seconds and it’s blank. Those AJAX requests are to blame since they are actually responsible for calling other service components that provide the profile details (SOA issues rearing their ugly head). Let’s take a look at one of the failed AJAX requests to see what happened.
Looking at the image below you can see there is an HTTP 503 status code as the response to the AJAX request. For those of you who don’t have all of the HTTP status codes memorized this is a “Service Unavailable” response. It means that the HTTP server was able to accept the request but can’t actually do anything with it because there are server side problems. Ugh… more server side problems.
Now that we have some good data about where the problems really lie, what can be done to fix them? The server side of the house must be addressed first as it will have an impact on all client side measurements. There are 2 key problems from the server side that need resolving immediately.
1) Functional Integration Errors
When 100s of services developed by different teams come together, they exchange data through APIs using XML (or similar) formats (note the AJAX failures shown above). Ensuring that all data exchange is accurate and errors are not occurring requires proper monitoring and extensive integration testing. Healthcare.gov was obviously launched without proper monitoring and testing and that’s a major reason why many user interactions are failing, throwing errors and insurance companies are getting incomplete data forms.
To be clear, this isn’t an unusual problem. We see this time and time again with many companies who eventually come to us looking for help solving their application performance issues.
To accelerate this process of finding and fixing these problems at least 5X faster, a product like AppDynamics is required. We identify the source of errors within and across services much faster in test or production environments so that Developers can fix them right away. The screenshots below show an example of an application making a failed service call to a third party and just how easy it is to identify with AppDynamics. In this case there was a timeout socket exception waiting for PayPal (i.e. the PayPal service never answered our web services call at all).
2) Scalability bottlenecks
The second big problem is that there are performance bottlenecks in the software that need to be identified and tuned quickly. Here here is some additional information that almost certainly apply to healthcare.gov …
a) Website workloads can be unpredictable, especially so with a brand new site like healtcare.gov. Bottlenecks will occur in some services as they get overwhelmed by requests or start to have resource conflicts. When you are dealing with many interconnected services you must use monitoring that tracks all of your transactions by tagging them with a unique identifier and following them across any application components or services that are touched. In this way you will be able to immediately see when, where, and why a request has slowed down.
b) Each web service is really a mini application of it’s own. There will be bottlenecks in the code running on every server and service. To find and remediate these code issues quickly you must have a monitoring tool that automatically finds the problematic code and shows it to you in an intuitive manner. We put together a list of the most common bottlenecks in an earlier blog post that you can access by clicking here.
Finding and fixing bottlenecks and errors in custom applications is why AppDynamics exists. Our customers typically resolve their application problems in minutes instead of taking days or weeks trying to use log files and infrastructure monitoring tools.
AppDynamics has offered to help fix the healthcare.gov website for free because we think the American people deserve a system that functions properly and doesn’t waste their time. Modern, service oriented application architectures require monitoring tools designed specifically to operate in those complex environments. AppDynamics is ready to answer the call to action.Link to this post:
Today AppDynamics announced integration with PagerDuty, a SaaS-based provider of IT alerting and incident management software that is changing the way IT teams are notified, and how they manage incidents in their mission-critical applications. By combining AppDynamics’ granular visibility of applications with PagerDuty’s reliable alerting capabilities, customers can make sure the right people are proactively notified when business impact occurs, so IT teams can get their apps back up and running as quickly as possible.
You’ll need a PagerDuty and AppDynamics license to get started – if you don’t already have one, you can sign up for free trials of PagerDuty and AppDynamics online. Once you complete this simple installation, you’ll start receiving incidents in PagerDuty created by AppDynamics out-of-the-box policies.
Once an incident is filed it will have the following list view:
When the ‘Details’ link is clicked, you’ll see the details for this particular incident including the Incident Log:
If you are interested in learning more about the event itself, simply click ‘View message’ and all of the AppDynamics event details are displayed showing which policy was breached, violation value, severity, etc. :
Let’s walk through some examples of how our customers are using this integration today.
Say Goodbye to Irrelevant Notifications
Is your work email address included in some sort of group email alias at work and you get several, maybe even dozens, of notifications a day that aren’t particularly relevant to your responsibilities or are intended for other people on your team? I know I do. Imagine a world where your team only receives messages when the notifications have to do with their individual role and only get sent to people that are actually on call. With AppDynamics & PagerDuty you can now build in alerting logic that routes specific alerts to specific teams and only sends messages to the people that are actually on-call. App response time way above the normal value? Send an alert to the app support engineer that is on call, not all of his colleagues. Not having to sift through a bunch of irrelevant alerts means that when one does come through you can be sure it requires YOUR attention right away.
If you are only sending a notification and assigning an incident to one person, what happens if that person is out of the office or doesn’t have access to the internet / phone to respond to the alert? Well, the good thing about the power of PagerDuty is that you can build in automatic escalations. So, if you have a trigger in AppDynamics to fire off a PagerDuty alert when a node is down, and the infrastructure manager isn’t available, you can automatically escalate and re-assign / alert a backup employee or admin.
The Sky is Falling! Oh Wait – We’re Just Conducting Maintenance…
Another potentially annoying situation for IT teams are all of the alerts that get fired off during a maintenance window. PagerDuty has the concept of a maintenance window so your team doesn’t get a bunch of doomsday messages during maintenance. You can even setup a maintenance window with one click if you prefer to go that route.
Either way, no new incidents will be created during this time period… meaning your team will be spared having to open, read, and file the alerts and update / close out the newly-created incidents in the system.
We’re confident this integration of the leading application performance management solution with the leading IT incident management solution will save your team time and make them more productive. Check out the AppDynamics and PagerDuty integration today!