ClusterOps Challenges and How AIOps Can Help

By | | 3 min read

Tag , , , , , ,

Today's resource managers and container orchestrators allow us to describe and deploy workloads in a consistent and repeatable fashion. So why are our workloads growing ever more complex?

The container revolution may be only a few years old, but it shows no signs of slowing. Organizations are placing an ever-widening variety of workloads in containers, and managing them with resource managers like Mesos and container orchestrators like Kubernetes. Cluster operations, or ClusterOps, is a discipline that evolved from middleware engineering. It helps enterprises deploy varied workloads to a resource manager or orchestrator, ensuring their systems are resilient, secure and available.

Not long ago, the middleware engineer was seen as the mystic gatekeeper between application infrastructure and scale. Tuning an application server or message broker required years of skills and institutional know-how. But as the number of purpose-built systems required to get an application to production slowly began to rise, those engineers who specialize in tuning specific application infrastructure components grew ever more scarce.

Today, middleware engineers are moving into ClusterOps roles. This is a logical evolution, as middleware experts have spent many years preparing for the container revolution. They’ve become the modern day architects and administrators of platform-as-a-service (PaaS) or container orchestrators—the gatekeepers of workloads. But despite their impressive credentials, ClusterOps engineers face the daunting challenge of having to support a growing number of complex platforms.

Forest for the Trees

Something I write about a good bit is the fog of development—essentially, the complexity of grasping the big picture when you’re mired in minutiae. If you’re a ClusterOps engineer, the fog settles in when you need to understand more and more of the stack, but have far less time to work on your projects. Unfortunately, the fog of development can lead to people pointing a finger—maybe at you—when their pet project goes astray.

As much as we strive for parity across our environments, the growing variety of workloads makes it challenging to monitor, maintain and repair problems with yesterday’s tools. You can’t, for instance, truly simulate a production system on a laptop, even a powerful one. With technologies like Minikube—or for the adventurous, Kubeadm-dind—you can attempt to mimic higher environments. However, modern workloads, such as machine learning workloads at scale, require far too much firepower and can’t be simulated effectively on a PC.

ML and ClusterOps: This Might Hurt a Little

Machine learning workloads not only tax system infrastructure, but also the skills of ClusterOps engineers. Due to intense infrastructure demands and the sometimes short-lived nature of spikes in ML workloads, resource managers need to react quickly by spinning up and down much-needed resources to address surges in demand.

Resource managers can be customized to support very specific workloads. There are purpose-built resource managers such as Apache Yarn for big data ecosystems, or generic resource managers like Apache Mesos, which supports a wider variety of workloads. Of course, businesses may find themselves dealing with multiple distributed system technologies and, again, it’s hard for engineers to find time to master these new platforms.

What If We Could Learn Systematically?

We can’t master everything, nor should we try. You don’t know what you don’t know is very much a reality, and humans simply aren’t very good at identifying gaps in their knowledge. So imagine if the enterprise platform itself could teach you the challenges it’s facing? With the rise of AIOps, it can. Disparate workloads of all types can benefit from a platform that provides instructive feedback and semi-to-fully autonomous action.

This isn’t a futuristic concept. Global companies and consumers are already working with automated systems every day. From the rapid growth in automobile automation to enterprise-level intrusion-detection picking out needle-in-a-haystack anomalies, automated systems are reducing the stress and strain on human experts.

Automation can help with the technology-adoption curve, too. Organizations might be reluctant to take on new technologies because of the uncertain ROI. With AppDynamics AIOps, however, the insights gleaned from multiple platforms and systems can feed into the Central Nervous System, providing systematic improvements and enhancing team knowledge. At AppDynamics, we’re excited to be at the leading edge of the AIOps revolution.

Ravi Lachhman

Ravi Lachhman

Ravi Lachhman is an evangelist at AppDynamics focusing on the Cloud and DevOps spaces. Prior to AppDynamics, Ravi has spent time at Mesosphere, Red Hat, and IBM helping enterprises and the federal sector design the next generation of distributed platforms. When not helping to further the technology communities, Ravi enjoys traveling the world especially with his stomach.