Scaling Kubernetes with Observability and Confidence

November 25 2020

Kubernetes makes it easy to deploy your microservice applications. Here are some of the ways AppDynamics helps ensure the health of those apps at scale.

Kubernetes is evolving at breakneck speed, and that was more evident than ever for anyone who attended KubeCon + CloudNativeCon 2020. The event was packed full of excellent content and innovation stories. I was especially impressed by the focus on simplifying the developer experience, including Project Tye from Microsoft and Knative for running serverless containers on K8s.

In this blog, I’ll share a bit about what I learned — and also some of the ways AppDynamics helps with simplifying Kubernetes observability at scale. In particular, we’ll focus on a few of our new and existing features that allow you to proactively monitor large deployments and get the deepest level of observability for your apps and clusters in the shortest time possible.

Similar to our previous blog on Kubernetes monitoring best practices, this blog assumes an advanced level of Kubernetes knowledge.

Challenge #1: Rolling Out Observability at Scale

Kubernetes makes it easy to deploy your microservice applications, and application performance monitoring (APM) tools like AppDynamics have proven essential for ensuring the health of those apps through OOTB tracing, code-level diagnostics, and baselining.

But unlocking the power of APM across microservice deployments that span multiple Kubernetes deployments, namespaces, and clusters becomes increasingly difficult at scale. And requiring that application teams change their code and/or images to include APM capabilities can be a show stopper. Kubernetes init containers help, as they can be used to install APM agents at deploy time and remove the need to manually alter each application image. But requiring changes to each app’s Kubernetes deployment spec can still be a huge challenge in larger organizations.

The AppDynamics Cluster Agent solves this problem through “auto-instrumentation,” which leverages the Kubernetes APIs to dynamically and automatically add an APM agent to a Kubernetes app (using init containers) without requiring any change to the application image.

This greatly simplifies day 1 and day 2 use cases — that is, rolling out APM instrumentation and managing upgrades as new APM agents become available. Consider the Cluster Agent configuration required to enable auto-instrumentation for a Java application:

We set the nsToInstrumentRegex property to determine which namespaces to target for auto-instrumentation and the java.image property to determine which Java APM agent image should be used to copy the agent bits. Then it’s merely a matter of applying this configuration. The Cluster Agent will take care of updating the applications in the specified namespaces —and Kubernetes will take care of performing a rolling update to restart the applications. When it comes time to upgrading the agent, all that’s required is an update to the java.image property and a re-apply of the Cluster Agent. That’s quite a bit simpler than updating each application!

We’ve rolled out auto-instrumentation support for Java, .NET Core, and Node.js applications and will be looking to add additional language support in the future.

Challenge #2: Finding the Root Cause for Failing Apps (and Pods)

Our second challenge is Kubernetes apps that aren’t healthy and are repeatedly restarting. This is a particularly challenging problem because often the app is not up long enough to leave any trace of what the problem is, and operators of large-scale deployments don’t have time to dig through logs and various tools to find the root cause.

The Cluster Agent has released a new feature to automatically capture log events associated with failing applications. While the Cluster Agent supports APM correlation — that is, the ability to connect a pod/container to the additional perspectives that the APM agent provides (tracing, code-level diagnostics, etc.) — this new feature doesn’t depend on APM. Instead, the Cluster Agent leverages the Kubernetes API to recognize an unhealthy app that is crashing and restarting and automatically collects the logs associated with it.

Below is an example Cluster Agent Pod Dashboard that shows pods for the devops-offers-profile-service-v2 deployment that are experiencing restarts.

appdynamics kubernetes cluster agent dashboard

If we drill down into the particular pod that shows 11 restarts, we get the pod-detailed view that shows a new “Error Log” section.

appdynamics kubernetes cluster agent dashboard

Drilling into the Actions gives us the ability to review the set of logs collected over a series of restart events and identify a memory issue as the ultimate root cause of the restarts.

Automatic log capture works for restarted pods as well as those that are experiencing CrashLoopBack events. See our documentation on Managing Logs for Pods for more info.

Challenge #3: Cluster Information Overload (or Proactive Cluster Health Alerts)

Another common challenge among operators with large deployments is basic information overload. There are several options for visualizing the health of a single cluster, such as the AppDynamics Cluster Agent Dashboard, but if you have tens if not hundreds of clusters, you need proactive alerting to tell you what you should pay attention to. Let’s quickly walk through an example of how you can achieve this with the AppDynamics Cluster Agent and its OOTB baselining of key cluster metrics.

A common cluster KPI to track is a cluster’s capacity to deploy additional applications or scale-out existing applications. Kubernetes provides the quota mechanism to cap the total CPU, memory, and storage capacity available to applications within a namespace, as well as a means to cap resources available per application (requests and limits).

Let’s take memory capacity as an example. We can track available memory capacity as the ratio between used capacity (total memory requests/limits of deployed applications in a namespace) and available CPU capacity in a namespace (set by the namespace quota). The Cluster Agent reports a “Request Used (%)” to track this ratio. The higher the number, the lower the available capacity.

It would make sense to monitor this ratio and trigger an alert whenever it either exceeds a threshold (say 90%) or suddenly increases, which may represent an application deployment/scaling event we want to know about. In the example below, we’re looking at the memory Request Used (%) metric reported by the Cluster Agent for a cluster with a 20G requests memory limit, and 12G used, so the current value is 60%.

The dotted line above is the baseline that AppDynamics automatically calculates for each cluster metric, which is the overall average from the last 30 days and is slightly higher at 66%.

We’ll define a health rule to monitor when Request Used (%) exceeds a static limit or changes significantly. We choose the Custom Health Rule Type and select the cluster we wish to monitor.

We set up a Critical Criteria so that the first “Static Limit Exceeded” condition will fire if the Request Used (%) exceeds 90 percent. The second “Baseline Exceeded” condition will fire if the same metric deviates from the baseline by 2 standard deviations, which would catch a sudden increase in capacity used.

Once the Evaluation Status shows as green, we know the alert is active and will alert when a condition is exceeded.

We’ll scale up the profile service app from 2 replicas to 20:

Within 5 minutes we see that the sudden spike in usage has caused the health rule to fire:

While we’ll likely want to tune this to be less sensitive (i.e. increase the number of standard deviations), we can see that the “Baseline Exceeded” condition has fired because the increase in usage is 2 standard deviations higher than the baseline of 60.

If we wait a bit longer we see the alert is updated and the second “Static Limit Exceeded” condition has fired, while overall memory usage has exceeded 90% for the namespace.

We can attach these health rule violations to actions such as opening a ticket to address the issue before it impacts applications. It’s also straightforward to leverage the AppDynamics configuration APIs to automatically deploy the health rules and thresholds that make sense for your deployments as clusters are provisioned.

That’s a quick tour of some of our newest features that help simplify Kubernetes observability at scale. For more info and a quick demo, check out our on-demand webinar, Optimize Microservices Performance with AppDynamics Cluster Agent.

This blog may contain product roadmap information of AppDynamics LLC (“AppDynamics”). AppDynamics reserves the right to change any product roadmap information at any time, for any reason, and without notice. Any information provided is intended to outline AppDynamics’ general product direction, it is not a guarantee of future product features, and it should not be relied on in making a purchasing decision. The development, release, and timing of any features or functionality described for AppDynamics’ products remain at AppDynamics’ sole discretion. AppDynamics reserves the right to change any planned features at any time before making them generally available as well as never making them generally available.

Jeff Holmes

Jeff Holmes is a Sales Enablement Engineer with AppDynamics. He has over 20 years of experience in the software industry, and is particularly interested in the monitoring of cloud platforms like Pivotal Cloud Foundry. His background includes roles as software engineer and application and enterprise architect. In a previous role as a solution architect for Pivotal, Jeff worked with customers who were in the early stages of their PCF deployments.