Infrastructure failures can expose gaps in system monitoring and inspire a chaos engineering approach—essentially tossing the proverbial monkey wrench into a system to proactively find and fix problems before they lead to outages. If you’re running a Pivotal Cloud Foundry (PCF) environment on Amazon AWS, for example, you may get notification of failures via an email like this:
“EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance…Due to this degradation your instance could already be unreachable.”
The resulting loss of a PCF component could have a devastating impact on your PCF infrastructure and its ability to support applications. A proactive monitoring solution that can respond to unexpected chaos is the key to limiting the damage.
A Simple Chaos Engineering Experiment
Inspired by these not-so-infrequent emails from AWS, I designed a simple chaos engineering experiment to test the resiliency of AppDynamics’ monitoring on the PCF platform, and the apps that run on PCF. This monitoring is delivered in two ways:
- The platform monitoring is available via the AppDynamics Platform Monitoring for PCF tile, which packages the Nozzle, or metrics client, and the Dashboard app that generates out-of-the box (OOTB) health rules and dashboards that implement the Pivotal monitoring guidance.
- The app monitoring is available via Cloud Foundry (CF) buildpack integration that instruments PCF apps running in containers with the AppDynamics APM agent.
Image source: Pivotal
To make my experiment more interesting, I decided to shut down a Diego Cell, which in a standard CF runtime is responsible for hosting app containers. The Diego Cells are the VMs (or EC2 instances in the case of AWS) in the following diagram, where the App Instances, or containers, are running.
My test foundation (a PCF term for a single PCF deployment) running on AWS only contains two Diego Cells, so shutting one down should produce some fireworks! Before we perform the shutdown, let’s review the current state reflected by the Single Foundation Dashboard provided OOTB by AppDynamics’ platform monitoring.
This dashboard shows the capacity, performance and VM health indicators based on Pivotal’s recommended metrics and thresholds. All indicators are operating within normal thresholds.
Running the Experiment
While I don’t have a Chaos Monkey at my disposal, I can go to the AWS console and shut down a VM to start the experiment.
My first experiment produced some completely unexpected results. As I was waiting for some indication in the Single Foundation Dashboard that the Diego Cell was down, I got an alert instead regarding Nozzle availability.
This is an alert that the AppDynamics Nozzle, the metric client that pulls platform and container metrics from the Loggregator Firehose metric subsystem and publishes them to the AppDynamics Controller, was down. After a little investigation, I realized that in my PCF foundation I had deployed only a single instance of the Nozzle, which, like any other PCF app, runs as a Go application in a container. And so when the Diego Cell that the Nozzle app was running on was shut down, and the remaining Diego Cell lacked the capacity to run it, the Nozzle (as well as a number of other PCF apps) could not be rescheduled by Diego and was stuck in a crashed state.
This was an important lesson: the Nozzle can be a single point of failure and should be scaled not only to handle large metric volumes but also to ensure availability. The good news is that the OOTB Nozzle Availability alert did recognize this situation. For the sake of this experiment, I intentionally shut down some apps to increase capacity, and scaled the Nozzle to test how the other components in my monitoring experiment would respond.
Re-running the Experiment
I made the necessary adjustments, and then restarted the Diego Cell to reach a steady state and shut it down again. Within a few minutes I could see the expected alert reflected in the dashboard widget.
Double-clicking the widget shows the BOSH VM health rules associated with the Diego Cell component, including the one that Diego Cell VM Health is violating.
If we view the summary of the violation conditions, we see it is the result of the “no data” condition that occurs when the Diego Cell is terminated, along with the BOSH agent that reports this metric.
As with any health rule violation event in AppDynamics, we could attach the event to an action, such as opening a ticket, to address the problem before it impacts applications. However, in this case, we’ve lost 50% of the foundation’s capacity to host containers, so we’re not likely to escape unscathed. The dashboard updates quickly to reflect the used memory capacity at 92%.
And the associated health rule, based on a rolling 30-minute average, alerts as soon as the average drops below the recommended 35% threshold for remaining capacity.
Shortly thereafter, we also see alerts for a spike in crashed app instances associated with the Diego BBS, which tracks the expected vs. actual number of running container instances.
The details regarding the violation reflect that this health rule is using standard deviations from a baseline as the threshold. But given the test nature of the foundation, there hasn’t been a significant amount of variation (none, in fact, since the standard deviation is 0).
Are the PCF Apps Experiencing Chaos?
We saw that the health rule for crashed instances has been violated. And we know that Diego will reschedule the crashed instances on the remaining Diego Cell (assuming there’s capacity!). This begs the question: Did the crashed and rescheduled application instances impact application performance? We can turn our attention to the orderservice and accountservice PCF apps, which were previously deployed and instrumented with the AppDynamics APM agent using the integrated CF buildpack support. The AppDynamics Application Dashboard shows the Application Flow Map during the period when the Diego Cell was shut down.
The response time did indeed spike at the point when the containers were being rescheduled, which interfered with the apps’ ability to maintain load.
This spike in response time was enough to trigger an OOTB business transaction health rule:
This health rule leverages the same deviation-from-baseline approach used in the Diego BBS Crashed App Instances alert. The configuration for the threshold is shown below.
In the custom dashboard (below) that integrates APM metrics from the AppDynamics APM agent and container metrics from the AppDynamics Nozzle and Loggregator Firehose, we see the same impact on response time and load/calls per minute. However, the container metrics don’t show a clear impact.
Monitoring Chaos from 10,000 Feet
So what did we learn? Our experiment produced useful and actionable events that revealed the infrastructure and application impact of suddenly reducing a PCF foundation’s capacity by 50%. We also learned that ensuring the availability of our platform-monitoring Nozzle client is important, too. A PCF operator has the additional challenge of monitoring this sort of chaos across several foundations, as is common in an enterprise PCF deployment. For this, AppDynamics offers a multi-foundation Dashboard. The example below shows two foundations, the second of which is the foundation that suffered the sudden loss in capacity from our experiment.