Product

Monitoring PCF at Scale: Drinking from the Loggregator Firehose

By | | 5 min read


Summary
As a PCF admin, are you suffering from information overload? Here's how AppDynamics makes it easier to find performance issues in Pivotal environments.

AppDynamics monitoring for Pivotal Cloud Foundry (PCF) provides a simplified view into PCF infrastructure, enabling operators to proactively address issues that impact the performance of the platform and the apps running on it.

The Risk of Information Overload

Knowing what to monitor in a PCF deployment can be a challenge. PCF leverages a number of layers and components to support running containers as scale, as indicated in this diagram of a single PCF foundation:

Source: Pivotal 

At the lowest level, BOSH-provisioned VMs run core Cloud Foundry (CF) components, such as the cloud controller and router that each perform an essential Pivotal Application Service (PAS) infrastructure function or job. The Diego Cell VMs host app containers, or app instances, where PCF apps run based on the supported buildpacks. A PCF deployment will provide KPI metrics for each VM, component and container via the standard CF Loggregator and agent subsystem, as well as the BOSH metrics forwarder component that deploys a BOSH agent on each VM.

The end result? A large volume of metrics. Toss in the fact that any PCF deployment will include multiple PCF foundations to ensure isolation across all layers of the virtual infrastructure, deployment environments and geographies, and an operator is facing information overload.

Source: Pivotal 

Leveraging Pivotal’s Monitoring Best Practices

Fortunately, Pivotal publishes monitoring guidelines based on best practices they’ve observed working with customers and by running Pivotal Web Services (PWS). These guidelines, or runbooks, greatly reduce the operator burden by filtering the full set of metrics available in a foundation, all the way down to a core set of performance and scaling indicators. Each indicator defines:

      • The metric or metric formula and why it should be monitored
      • The threshold that should generate an alert
      • The action to take if a threshold is breached

     

As an example, consider the indicator Crashed App Instances, which references the bbs.CrashedActualLRPs metric associated with the Diego BBS component.

The Description (above) explains why an increase in the specified metric could signify either a platform or application issue. The Recommended response explains the response or action that should be taken to resolve the issue. This is an example of a KPI associated with a particular PCF component.

There are also capacity scaling indicators, such as Diego Cell Memory Capacity, which are focused on proactively determining when resources need to be scaled. And BOSH system metrics are indicators that track the health and resource usage of the BOSH-provisioned VMs. As an operator, it clearly makes sense to plan your monitoring strategy around these indicators, with the assumption that you can tune them as your PCF deployment changes. 

The AppDynamics Approach

In addition to providing core platform metrics via a standard CF Loggregator Firehose nozzle, AppDynamics delivers a set of out-of-the box (OOTB) alerts/health rules and dashboards that implement the best practices provided by Pivotal, with support for monitoring multiple foundations. For alerting, we deploy OOTB over 100 alerts or health rules with thresholds that implement the performance and scaling indicators defined by Pivotal. Operators can modify these alerts based on the behavior of their foundations, and attach them to actions to open tickets and automate resolution.

For example, the OOTB health rule corresponding to the Crashed App Instances indicator is shown here: 

Since AppDynamics baselines every metric, this health rule’s alert threshold can track when the metric CrashedActualLRPs deviates from a baseline, which represents the foundation’s expected rate of crashed app instances.

Other health rules, such as Diego Cell Memory Capacity, implement static thresholds and are based on a formula or metric expression that leverages multiple metrics.

 

When a threshold is breached, a health rule violation is created and can trigger an action such as opening a ticket, based on the comprehensive alerting capabilities offered by the AppDynamics Controller. 

For dashboards, AppDynamics provides OOTB the Single Foundation Dashboard that summarizes the status of these health rules, giving you a single view of your foundation, which covers what matters most and can be customized to suit your needs. 

We also provide OOTB dashboards for monitoring multiple foundations; these dashboards provide an aggregated view across multiple foundations with drill-down capability, as well as the ability to filter the set of foundations shown on the dashboard.


 

Other Monitoring Approaches

A few competing PCF monitoring solutions provide visibility into core platform metrics. But compared to AppDynamics, these offerings lack the means to monitor health and capacity through the lens of Pivotal’s best practices.

Prometheus and Grafana, for example, are great open source tools with a nozzle and OOTB dashboards that focus on different aspects of the PCF platform, such as capacity or VM health. But while they provide basic alerting, these tools lack a full implementation of performance and scaling indicators defined by Pivotal. To fill this gap, you might be tempted to integrate alerts from Pivotal’s Healthwatch, which provides a REST API to query alert status, and build custom dashboards to reflect current status. However, if you’re monitoring multiple foundations, it’s a major burden to build more integrated views, alert to avoid information overload, and proactively respond to issues. Furthermore, when adopting these open-source solutions, you’ll need to manage the additional tool infrastructure and silos they introduce.  

Integrated APM and Platform Monitoring for PCF

AppDynamics provides an integrated PCF monitoring solution that reduces the burden of proactively monitoring PCF infrastructure and its impact on app performance (and vice versa). It eliminates the need to stand up additional tool infrastructure, or complicate your runbooks by jumping from one tool to another to resolve app and platform performance issues. Our solution gives app teams a view into APM, container and infrastructure performance—all to help answer the question of whether the app or platform is the root cause of performance issues.

Try AppDynamics Platform Monitoring for PCF today!