Product

To Restart or Not Restart? Managing Stateful Kubernetes Deployments with Operators and ConfigMaps

By | | 6 min read


Summary
Here is how configMaps can help minimize costly restarts of stateful custom resources in Kubernetes.

Stateless apps are easy, stateful are not. How about those apps in between? Those that cache things for performance reasons but can tolerate data loss. Those apps that can reobtain the data from some other highly available entity—the uptime of which is someone else’s headache.

Let’s consider the AppDynamics ClusterAgent, a monitoring solution for various Kubernetes distributions. The ClusterAgent is a Golang app that:

  • Observes Kubernetes resources such as pods, endpoints, events, and so on
  • Extracts and categorises performance metrics that are critical for monitoring applications deployed to Kubernetes environments
  • Sends these metrics to the AppDynamics controller for further processing

The metrics are then automatically baselined and become the basis for health rules and alerts.

The ClusterAgent is implemented as a controller with shared informers for the respective entities that the agent observes. In a nutshell, each informer caches the data it’s responsible for—a list of pods, for example—and then watches for changes in the observed entities and processes only those that have changed. (You can learn more about informers here.) In addition to the Kubernetes resources, the ClusterAgent also queries some data from the AppDynamics controller (e.g., the account information, entitlement and AppDynamics component IDs) to tie them to the Kubernetes entities. This data is incrementally cached to minimize the load on the AppDynamics controller.

If the ClusterAgent is restarted for whatever reason, the first thing it does is rebuild its internal caches. The informers ask the Kubernetes API server for up-to-date information about the entities, including the events that have transpired in the meantime. The ClusterAgent also goes back to the AppDynamics controller to re-cache the required information. Depending on the monitoring scope and the size of the cluster, the time to rebuild the cache will vary. Eventually, the agent catches up and continues its normal operation. The only problem with rebuilding the internal cache upon each restart is that the process, if frequent, may apply an undue burden on the Kubernetes API server, etcd, and the AppDynamics controller—and reducing this burden was the main reason for caching in the first place.

The ClusterAgent has a number of configuration properties that control its behavior. For example, you can configure the agent to monitor specific namespaces, zoom into metric collection for certain deployments, or adopt new instrumentation rules. These changes can be made by monitoring the Ops team ad hoc and quite frequently. If you were to restart the agent every time a property changes, you could overwhelm the providers of data with frequent requests.

To strike a balance, the ClusterAgent was designed to pick up dynamic updates to its configuration without restart. The implementation boils down to listening to changes to the ClusterAgent configMap in real time.

Dynamic ConfigMaps

ConfigMaps in Kubernetes can be mounted in at least two different ways. In both cases, we need to declare a volume of the type configMap first. For example:

volumes:
 - configMap:
 name: cluster-agent-config
 name: agent-config

Then, the configMap can be mounted to the container by specifying the mountPath.

volumeMounts:
 - mountPath: /opt/appdynamics/config/
 name: agent-config

Or, the mountPath along with subPath.

volumeMounts:
 - name: ma-log-volume
 mountPath: /opt/appdynamics/conf/logging/log4j.xml
 subPath: log4j.xml

In the first instance, the entire contents of the directory identified by the mountPath is replaced with a symbolic link to the configMap.

  $ ls -la /opt/appdynamics/config/
lrwxrwxrwx. 1 root root 32 May 21 23:09 cluster-agent-config.json -> ..data/cluster-agent-config.json

In the second, the configMap file is copied to the location specified by the mountPath and the subPath directives, without touching any pre-existing files.

We use the first approach because in this case all subsequent updates to the configMap can propagate to the consumer, the ClusterAgent, which has special logic to listen to changes in the configMap. ClusterAgent uses fsnotify.Watcher to track updates made by the Kubernetes AtomicWriter. The map, when updated, is actually replaced in its entirety. The old copy is deleted and the new copy is placed in the location pointed at by the symlink.

When a configMap update occurs, Kubernetes AtomicWriter creates a new directory and writes the updated ConfigMap contents into it. Once the write is complete, the original file symlink is removed and replaced with a new symlink pointing to the contents of the newly created directory. This is done to achieve atomic configMap updates. It also means we are not handling write events, but rather the file delete event.

tick := time.Tick(self.interval)
 
 var lastWriteEvent *fsnotify.Event
 for {
 select {
 case event := <-self.fsNotify.Events:
 
 if event.Op == fsnotify.Remove || event.Op == fsnotify.Write {
 // Since the symlink was removed, we must
 // re-register the file to be watched
 self.fsNotify.Remove(event.Name)
 self.fsNotify.Add(event.Name)
 lastWriteEvent = &event
 }
 
 // If it was a write event
 if event.Op == fsnotify.Write {
 lastWriteEvent = &event
 }
 case <-tick:
 // No events during this interval
 if lastWriteEvent == nil {
 continue
 }
// Execute the callback to notify the observer of changes in the //configMap. At this point the new version of the configMap is loaded //into memory.
 self.callback()
 // Reset the last event
 lastWriteEvent = nil
 case <-self.done:
 goto Close
 }
}

The ClusterAgent watches for the Delete event from the configMap file. When the event fires, the ClusterAgent reloads the new file and updates its internal configuration. A working example can be found in the ClusterAgent repo on Github.

The AppDynamics Operator

The ClusterAgent is a custom resource and, as such, is managed by an operator. The AppDynamics Operator is a single-replica deployment that makes sure the ClusterAgent is up and running according to the spec.

Why do we need the operator? We could simply deploy the ClusterAgent directly by pushing the individual specs of resources it needs. However, when it comes to configuration changes, it seems more intuitive to edit the custom resource through declarative changes to its spec rather than tweaking the configMap. We also need to remember when to restart the agent, or when to let it run with the new settings from the updated configMap. The ClusterAgent operator takes care of this for us and makes updates to the agent spec much easier. It can differentiate between the breaking changes that require a restart, and benign changes that do not. It also takes care of deploying all the dependencies and ClusterAgent upgrades.

In addition to the ClusterAgent, the AppDynamics Operator also can manage the deployment of the AppDynamics Machine Agent, which provides server and network visibility. The ClusterAgent is complementary to the Machine Agent and requires a lower level of security access on the cluster, hence they are independent of each other. The AppDynamics Operator can deploy these agents individually or in tandem.

The Appdynamics Operator is available on GitHub and the Red Hat Certified Container Catalog.