Scaling our End User Monitoring Cloud

image_pdfimage_print

Why End User Monitoring?

In a previous post, my colleague Tom Levey explained the value of Monitoring the Real End User Experience. In this post, we will dive into how we built a service to scale to billions of users.

The “new normal” for enterprise web applications includes multiple application tiers communicating via a service-oriented architecture that interacts with several databases and third-party web services. The modern application has multiples clients from browser-based desktops to native applications on mobile. At AppDynamics, we believe that application performance monitoring should cover all aspects of your application from the client-side to the server-side all the way back to the database. The goal of end user monitoring is to provide insight into client-side performance and capture errors from modern javascript-intensive applications. The challenge of building an end user monitoring service is that every single request needs to be instrumented. This means that for every request your application processes, we will process a beacon. With clients like FamilySearch, Fox News, BackCountry, ManPower, and Wowcher, we have to handle millions of concurrent requests.

1geo

AppDynamics End User Monitoring enables application owners to:

  • Monitor Their Global Audience and track End User Experience across the World to pinpoint which geo-locations may be impacted by poor Application Performance
  • Capture end-to-end performance metrics for all business transactions – including page rendering time in the Browser, Network time, and processing time in the Application Infrastructure
  • Identify bottlenecks anywhere in the end-to-end business transaction flow to help Operations and Development teams triage problems and troubleshoot quickly
  • Compare performance across all browsers types – such as Internet Explorer, FireFox, Google Chrome, Safari, iOS and Android
  • Track javascript errors

“Fox News already depends upon AppDynamics for ease-of-use and rapid troubleshooting capability in our production environment,” said Ryan Jairam, Internet Operations Lead at Fox News. “What we’ve seen with AppDynamics’ End-User Monitoring release is an even greater ability to understand application performance, from what’s happening on the browser level to the network all the way down to the code in the application. Getting this level of insight and visibility for an application as complex and agile as ours has been a tremendous benefit, and we’re extremely happy with this powerful new addition to the AppDynamics Pro solution.”

EUM Cloud Service

The End User Monitoring cloud is our super-scalable platform for data analysis and processing end user requests. In this post we will discuss some of the design challenges of building a cloud service capable of supporting billions of requests and the underlying architecture. Once End User Experience monitoring is enabled in the controller, your application’s requests are automatically instrumented with a very small piece of javascript that allows AppDynamics to capture critical performance metrics.

Screen Shot 2013-07-25 at 9.47.14 AM

The javascript agent leverages Web Episodes javascript timing library and the W3C Navigation Timing Specification to capture the end user experience metrics. Once the metrics are collected, they are pushed to the End User Monitoring cloud via a beacon for processing.

EUM (End User Monitoring) Cloud Service is our on-demand, cloud based, multi-tenant SaaS infrastructure that acts as an aggregator for the entire EUM metrics traffic. All the EUM metrics from the end user browsers from different customers are reported to EUM Cloud service. The raw browser information received from the browser is verified, aggregated, and rolled up at the EUM Cloud Service. All the AppDynamics Controllers (SaaS or on-premise) connect to the EUM Cloud service to download metrics every minute, for each application.

Design Challenges

On-Demand highly available

End users access customer web applications anywhere in the world and any time of the day in different time zones, whenever an AppDynamics instrumented web page is accessed. From the browser, EUM metrics are reported to the EUM Cloud Service. This requires a highly available on-demand system accessed from different geo locations and different time zones.

Extremely Concurrent usage

All end users of all AppDynamics customers using EUM solution continuously report browser information on the same EUM Cloud Service. EUM Cloud Service processes all the reported browser information concurrently and generate metrics and collect snapshot samples continuously.

High Scalability

The usage pattern for different applications throughout the day is different; the number of records to be processed at EUM Cloud vary with different applications at different times. The EUM Cloud Service automatically scale up to handle any surge in the incoming records and accordingly scale down with lower load.

Multi Tenancy support

The EUM Cloud Service process EUM metrics reported from different applications for different customers; the cloud service provides multi-tenancy. The reported browser information is partitioned based on customers and their different applications. EUM Cloud Service provides a mechanism for different customer controllers to download aggregated metrics and snapshots based on customer and application identification.

Cost

The EUM Cloud Service needs to be able to dynamically scale based on demand. The problem with supporting massive scale is that we have to pay for hardware upfront and over provision to handle huge spikes. One of the motivating factors when choosing to use Amazon Web Services is that costs scale linearly with demand.

Architecture

The EUM Cloud Service is hosted on Amazon Web Services infrastructure for horizontal scaling. The service has two functional components – collector and aggregator. Multiple instances of these components work in parallel to collect and aggregate the EUM metrics received from the end user browser/device. The transient metric data be transient is stored in Amazon S3 buckets. All the meta data information related to applications and other configuration is stored in the Amazon DynamoDB tables.

A single page load will send one or more beacon–one per base page and every iframe onload and one per ajax request. Javascript errors occurring post page load are also sent as error beacons.

The functionality of the nodes is to receive the metric data from the browser and process it for the controller:

  • Resolve the GEO information (request coming from the country/region/city) and add it to the metric using a in-process maxmind Geo-resolver.
  • Parse the User-Agent information and add browser information, device information and OS information to the metrics.
  • Validate the incoming browser reported metrics and discard invalid metrics
  • Mark the metrics/snapshots SLOW/VERY SLOW categories based on a dynamic standard deviation algorithm or using static threshold

Load Testing

For maximum scalability, we leverage Amazon Web Services global presence for optimal performance in every region (Virginia, Oregon, Ireland, Tokyo, Singapore, Sao Paulo). In our most recent load test, we tested the system as a collective to about 6.5 B requests per day. The system is designed to easily scale up as needed to support infinite load. We’ve tested the system running at many billions of requests per day without breaking a sweat.

Check out your end user experience data in AppDynamics

4breakdown

Find out more about AppDynamics Pro and get started monitoring your application with a free 15 day trial.

As always, please feel free to comment if you think I have missed something or if you have a request for content in an upcoming post.

Copyright © 2014 AppDynamics. All rights Reserved.