Is Serverless the New Mainframe?

A few years ago at the DevNexus developers conference in Atlanta, I was chatting with a few former colleagues over dinner. As an avid fan of the HBO show Silicon Valley, as well as a technologist impressed with the rise of serverless technologies, I jokingly said we should “build the Box“—i.e., a serverless appliance not unlike the show’s Box by Pied Piper. But one ex-colleague said I was really late to the game. I gave him a confused look and he said, “Serverless isn’t new. Mainframes have been around since the 1950s and they’re certainly not a server. You could say they’re serverless, though.”

A debatable point, perhaps, but one thing is certain: Our technology journey has evolved far from the mainframes of the 1970s. One could argue that 2001 marked the introduction of the “modern” mainframe operating system—the IBM z/OS. That was nearly two decades ago, and it wouldn’t be surprising if today’s millennial software engineer knows far more about serverless than big iron.

For most of my academic and professional career, Java deployed on a UNIX-like operating system was the world I knew. But serverless is rapidly becoming the norm for modern applications. What’s interesting is that there are a lot of parallels between workloads that run on the mainframe and those that run on a serverless infrastructure like Lambda.

Kubernetes is King, What is this Mainframe Nonsense?

Big iron may be old school, but it remains a major force in the enterprise. As recently as 2017, approximately $3 trillion in daily commerce was flowing through COBOL systems, Reuters reports. My first interaction with a mainframe came fresh out of college. As part of a consulting team performing integration testing, I would receive the occasional error from a mysterious (to me) CICS gateway. (In case you’re wondering, the Customer Information Control Center, or CICS, is the prominent way applications interact with an IBM z/OS or z/VSE mainframe.)

The buzz these days, of course, centers on Kubernetes and other Cloud Native Computing Foundation (CNCF) projects. And with the big push into Cloud Native and Hybrid Cloud, a lot of modern infrastructure is designed to scale out. A typical mainframe, by comparison, is built to scale up.

A quick refresher on scale up vs scale out: To scale up is to add more resources to a node or, in modern parlance, migrate to a larger cloud instance. To scale out is to add more nodes to the pool or (in modern speak) to multiple cloud instances. These scaling paradigms certainly represent two different eras.

New Tools, Similar Workloads

The mainframe’s core attributes are reliability and scalability. Big iron measures uptime in years and can scale up by adding computing resources to its sizable frame. Today, we develop workloads to be widely distributed, adding new layers of complexity. The good news is that we have orchestrators and resource managers like Kubernetes, Yarn and Mesos to act as a facade for endpoints, federating the workload out. In the heyday of the mainframe, these modern tools weren’t available, even though mainframe workloads can be quite similar to their serverless counterparts.

Trigger, Work, Output

The three core activities of a serverless implementation are to trigger the function, execute it, and produce output. These actions are similar to mainframe activity involving batch and interactive (OLAP) workloads, as the mainframe is quite good at running code or a function over many iterations.

Similarly, it takes less time to deploy and execute a Lambda function than it does using traditional server, virtual machine, or even container-based approaches; as complex as big iron can seem, the specific target runtime for a function on a mainframe is very similar to that on a serverless implementation.

But Big Iron or Serverless Can Be a Black Box

If you’re a developer who submits code to execute externally—either on a mainframe or a serverless provider like AWS Lambda—it can feel like you’re working with a black box that masks its inner workings. In both cases you’re billed for duration, which for Lambda users is the time your code begins executing until it returns or terminates. But it’s hard to oversee duration if you’re not properly profiling the function.

When designing a traditional Java workload, for example, you can use tools that provide direct access to the Java virtual machine, or even the Java container, to gain deep insights into what’s going on. But that level of insight goes away with a Java Lambda because the serverless instance is typically short-lived. (Amazon is adding support with its X-Ray distributed tracing system, but not all serverless implementations have access to these tools.)

And the mainframe? With an IBM z/OS workload, you can monitoring a running job and view output with the System Display and Search Facility (SDSF), which is not unlike validating that a serverless function has run and its output has met expectations.

These big iron and serverless scenarios, unfortunately, can lead to inefficient execution and a higher costs, due either to the mainframe leveraging more logical partitions (LPARs), or more serverless instances or durations being used.

Modern Workload with a Modern Partner

More enterprises are embracing the strangler pattern for brownfield development,  incorporating mainframe, serverless, and a host of other architectures. But even with workloads and skill sets shifting from mainframe to serverless, a lack of insight—the black box—remains an issue. By partnering with an APM vendor like AppDynamics, you can gain insights across your entire end-to-end environment, including mainframe and serverless implementations.

Internet of Things: The 3 Whys of IoT

Businesses are embracing the Internet of Things (IoT), which enables them to get closer to their users by delivering services right to the edge—think sensors on a remote oil pipeline, or the smartwatch or fitness band on your wrist—for a better customer experience. The number of IoT-connected devices worldwide is expected to nearly triple in just six years, from 26.66 billion today to 75.44 billion in 2025.

This explosive growth will continue for the foreseeable future, too, as surging demand for connected-IoT hardware expands across consumer, business and industrial markets. According to ReportLinker, a French market research firm, Europe by 2020 will be the global leader in terms of IoT usage, accounting for 36% of all connections, followed by the Americas at 28%.

Why IoT is Complex

Although the Internet of Things delivers tangible benefits, the ROI, integration and management of this burgeoning ecosystem is not only complex, it also can be a barrier to market growth, services adoption and organisational change. Edge devices contain a wealth of information, but too often they’re seen as a black box and are underutilized. When performance problems occur, businesses lack diagnostic tools for real-time visibility. And while edge devices deliver a critical avenue of interaction with customers, they often require a large initial investment and a lengthy, uncertain ROI.

Why AppDynamics for IoT

Cisco and AppDynamics understand the IoT is a crucial strategy for many businesses. We seek to help drive your digital transformation efforts by demystifying IoT services. Digital transformation is about simplifying the customer experience, an effort involving significant investment and greater backend complexity and distribution, including more dependencies and an exponential impact on scale. The question therein lies: How can this complexity be understood?

Here’s where AppDynamics helps some of the world’s largest enterprises understand their customer interaction. The only constant is what customers do. Every touch or click sets off a transaction with the potential to initiate thousands of lines of code, and trigger hundreds of API calls that connect tens of hundreds of interconnected microservices, each residing in a different location and leveraging multiple container-clustering technologies.

With AppDynamics, you can unlock the value of IoT by following the Business Journey from start to finish, understanding the behaviour of embedded application devices, networks and users. By harnessing real-time visibility and end-to-end monitoring for edge devices and gateways with business correlation, you can better understand device performance, communication, and interaction—both in capturing data and diagnosing how business objectives are impacted.

End-to-end visibility, from IoT device to backend.

Typical IoT use cases include:

  • Edge Performance Monitoring: Monitor the health of devices with critical metrics such as errors, network requests and usage, whilst correlating this data with business performance indicators.
  • Device and User Segmentation: Gain insight into specific devices, platforms, data and users to ensure a consistent user experience across all segments.
  • Release Validation: Obtain empirical evidence of how a software release impacts not only performance, but also the business.
  • SLA Compliance: Ensure devices are compliant. Send the right data on time, consistently and to the correct endpoints.

At AppDynamics, we don’t just make the IoT comprehendible, we go one step further and correlate insights collected from IoT endpoints with business-relevant data to provide context. This helps organisations:

  • Drive down operational cost
  • Reduce downtime by offering remote diagnostics
  • Increase operational efficiency and insight by knowing what IoT devices are in use, and what data is being sent, how often, and when

Why IoT is Important Now

The accurate market perception of IoT is that it’s a necessity—a key component of digital transformation. Whether a business is looking to reimagine the ways it interacts with customers, or is moving from one business model to another, the IoT is usually a core part of that transformation. Whether a customer is an automobile manufacturer, a bank, a credit card processing intermediary, an industrial robot manufacturer, or a retailer with POS devices, the IoT is present.

Every business is looking at how the IoT can be leveraged, introducing new levels of complexity to an already intricate ecosystem. It’s estimated that 50 billion devices will be online by 2020, whereas only 0.06% of all devices are actually being leveraged as connected devices, according to Cisco. With the Internet of Things at the epicentre of most organisations’ digital transformation strategies, the IoT represents an enormous investment from the business, with equal levels of opportunity.

Bain predicts the IoT market will more than double between 2017 and 2021—from $235B to a staggering $520B. There are, of course, barriers that limit the adoption of IoT analytics solutions, including market competition. But putting that to one side, all vendors will experience similar limitations with IoT adoption, including security issues, integration with existing technology, and uncertain ROI, Bain forecasts.

AppDynamics offers many advantages here. Consider return on IoT investment: AppDynamics’ BusinessIQ makes this determination far easier by correlating the performance of IoT services with business performance indicators, carving a realistic route to ROI.

When it comes to IT/OT integration and security, Cisco provides best-in-class solutions. And as part of Cisco’s portfolio, AppDynamics can deliver a cohesive IoT solution to our joint customer base.

Learn how AppDynamics IoT monitoring can provide real-time visibility, diagnostics, and analytics for your connected device applications.

Top Takeaways from KubeCon + CloudNativeCon

The Emerald City played host to the North American edition of KubeCon + CloudNativeCon 2018, bringing together technologists eager to share their knowledge of Kubernetes and Cloud Native topics. The event, hosted by the Cloud Native Computing Foundation (CNCF), sold out over a month before it opened, a strong sign of growing interest in all things cloud.

The figurative 800-pound gorilla at the conference was, of course, Kubernetes (written shorthand as K8s). Perhaps to soften K8s’ reputation for complexity, Phippy the yellow giraffe and her animated friends were on hand to join the CNCF. Popularized by Matt Butcher’s and Karen Chu’s “The Illustrated Children’s Guide to Kubernetes,” Phippy was brought onboard to help developers explain resource management in distributed Cloud Native applications to friends, family and whoever else will listen.

The Omnipresent Cloud

Cloud computing continues to evolve at a rapid pace, and with the rise of IoT and edge computing, the cloud is showing up everywhere. Thanks to the hard work spearheaded by the CNCF, cloud-optimized platforms—everything from public cloud infrastructure to a secure private cloud in a datacenter—are now open to a broader range of workloads. You may be interacting with a Cloud Native workload right now, in fact. From your car acting as an edge cloud (or node)—making sure you arrive safely and timely to your destination—to the infrastructure powering this blog, the cloud is ubiquitous and evolving.

Projects Charging Forward

As noted by Aqua Security technology evangelist Liz Rice in her keynote, the CNCF has seen tremendous growth in recent years with the number of projects, with this year’s Seattle attendance more than doubling the 2017 turnout in Austin. Key projects are advancing, too. Envoy, which provides a scalable service mesh for communications between microservices and components, and Prometheus, an open-source systems monitoring and alerting toolkit, became graduated projects this year, joining Kubernetes at the CNCF’s highest maturity level. In addition, a year-over-year comparison of the CNCF Landscape (an interactive guide of Cloud Native technologies), shows more icons being squeezed onto a single webpage—a good indication of more CNCF projects and vendors being brought onboard.

If Kubernetes is Happy, Are You?

The Kubernetes Web UI can give you a quick view of cluster health. And if you don’t see any red on the dashboard, all’s right with the world…right? Not necessarily. Cluster health only tells part of the story. Kubernetes is designed to be generic for a wider swath of workload placement. Applications that hog cluster infrastructure by triggering an autoscaler can go unnoticed until the cluster limits are reached. Advancements like Operators can help make Kubernetes more application-specific, but too often the end-to-end picture of the user journey is incomplete. AppDynamics can add value in monitoring both the Kubernetes platform and workloads placed on Kubernetes clusters. The efficiency and health of your Kubernetes platform, and the complete picture of the user journey—which can transverse Kubernetes and non-Kubernetes workloads—can be monitored with AppDynamics.

Focus on Your Customers

With technology changing so rapidly, it’s easy to feel left behind if you’re not adopting the latest stack. At the conference, Aparna Sinha, Google’s Product Manager for Kubernetes, gave an excellent interview on trends and capabilities of Kubernetes and KNative. A lot of what is driving this rapid change is direct feedback from Google’s customers.

Organizations of all sizes are striving to strengthen the customer experience, but often it’s challenging to justify technology infrastructure to meet business or customer outcomes. AppDynamics is the premier platform for monitoring and enhancing the user journey across your Cloud Native applications. We’re continuing to expand our level of automation in orchestrators such as Kubernetes and Mesos, and we’re working to enable these platforms to take truly autonomous actions to enhance the user experience.

AIOps: Decision Center for Orchestrators

AIOps is one of those buzzwords that seems to be everywhere these days, and for good reason. The power that AIOps brings is its ability to guide orchestrators to act by analyzing and hypothesizing outcomes. Kubernetes and other orchestrators like Mesos and Nomad are very good at action (e.g., triggering a scaling event). Where they fall short, though, is in analyzation because, again, they are trying to capture a wide swatch of workloads. The worst case scenario is that an autoscaler keeps getting triggered within a cluster—perhaps due to run-of-the-mill memory pressure or network saturation—and the resource manager can’t place work anymore, leading to a queue of unfulfilled requests. Depending on the organization, a frantic Slack to an SRE would not be far behind.

With AIOps, this event could be avoided: the system could spot the trend and redeploy a version of the application and application infrastructure better suited to the load. Coupled with the power of Cisco’s stack, AIOps-infused AppDynamics will soon be more resilient without operator intervention.

Better Together

AppDynamics was excited to partner with Cisco at KubeCon + CloudNativeCon to run joint sessions on Cisco Cloud Center and the AppDynamics platform. We’re increasing our cross-pollination with Cisco Container Platform and Cisco Cloud to monitoring and manage the next generation of workloads, too.

Ravi Lachhman, in Cisco’s booth, gives a presentation on Cloud Native infrastructure.

If the pace of Cloud Native innovation keeps up with CNCF expansion, next year is certainly going to be exciting. AppDynamics and Cisco are the engine that helps combine technical decisions with business outcomes. Be sure to sign up for and watch Cisco’s Cloud Unfiltered Podcast, which includes interviews with the key technologists moving cloud workloads forward.

See You in SD!

We’re excited to see what next year has in store for the CNCF and Cloud Native workloads. KubeCon + CloudNativeCon 2019 will take place in sunny San Diego, and we hope to see you there. Don’t forget to fire up your favorite learning platforms such as Katacoda and Cisco DevNet to sharpen your Cloud Native skills. See you in San Diego!

Why Kubelet TLS Bootstrap in Kubernetes 1.12 is a Very Big Deal

Kubelet TLS Bootstrap, an exciting and highly-anticipated feature in Kubernetes 1.12, is graduating to general availability. As you know, the Kubernetes orchestration system provides such key benefits as service discovery, load balancing, rolling restarts, and the ability to maintain container counts by replacing failed containers. And by using Kubernetes-compliant extensions, you can seamlessly enhance system functionality. This is similar to how Istio (with Kubernetes) provides added benefits such as robust tracing/monitoring, traffic management, and so on.

Until now, however, Kubernetes did not provide similar automation features for security best practices, such as mutually-authenticated TLS connections (mutual-TLS or mTLS). These connections enable developers to use simple certificate directives that limit nodes to communicate with predetermined services—all without writing a single line of additional code. Even though the use of TLS 1.2 certificates for service-to-service communication is a known best-practice, very few companies use mutual-TLS to deploy their systems. This lack of adoption is due mostly to greater deployment difficulties in creating and managing public key infrastructures (PKI). This is why the new TLS Bootstrap module in Kubernetes 1.12 is so exciting: It provides features for adding authentication and authorization to each service at the application level.

The Power of mTLS

Mutual-TLS mandates that both the client and server must authenticate themselves by exchanging identities (certificates). mTLS is made possible by provisioning a TLS certificate to each Kubelet. The client and server use the TLS handshake protocol to negotiate and set up a secure encryption channel. As part of this negotiation, each party checks the validity of the other party’s certificate. Optionally, they can add more verification, such as authorization (the principle of least privilege). Hence, mTLS will provide added security to your application and data. Even if malicious software has taken over a container or host, it cannot connect to any service without providing a valid identity/authorization.

In addition, the Kubelet certificate rotation feature (currently in beta) has an automated way to get a signed certificate from the cluster API server. The Kubelet process accepts an argument, -rotate-certificates, which controls whether the kubelet will automatically request a new certificate as the current one nears expiration. The kube-controller-manager process accepts the argument –experimental-cluster-signing-duration, which controls the length of time each certificate will be in use.

When a kubelet starts up, it uses its initial certificate to connect to the Kubernetes API and issue a certificate-signing request. Upon approval (which can be automated with a few checks), the controller manager signs a certificate issued for a time period specified by the duration parameter. This certificate is then attached to the Certificate Signing Request. The kubelet uses an API call to retrieve the signed certificate, which it uses to connect to the Kubernetes API. As the current certificate nears expiration, the kubelet will use the same process described above to get a new certificate.

Since this process is fully automated, certificates can be created with a very short expiry time. For example, if the expiration time is one hour, even if a malicious agent gets hold of the certificate, the compromised certificate will still expire in an hour.

Robust Security and the Strength of APM

Mutual-TLS and automated certificate rotation give organizations robust security without having to spend heavily on firewalls or intrusion-detection services. mTLS is also the first step towards eliminating the distinction of trusted and non-trusted connections. In this new paradigm, connections coming from inside the firewall or corporate network are treated exactly the same as those from the internet. Every client must identify itself and receive authorization to access a resource, regardless of the originating host’s location. This approach safeguards resources, even if a host inside the corporate firewall is compromised.

AppDynamics fully supports mutually-authenticated TLS connections between its agents and the controller. Our agents running inside a container can communicate with the controller in much the same way as microservices connect to each other. In hybrid environments, where server authentication is available only for some agents and mutual authentication for others, it’s possible to set up and configure multiple HTTP listeners in Glassfish—one for server authentication only, another for both server and client authentication. The agent and controller connections can be configured to use the TLS 2 protocol as well.

See how AppDynamics can provide end-to-end, unified visibility into your Kubernetes environment!



Blue-Green Deployment Strategies for PCF Microservices

Blue-green deployment is a well-known pattern for updating software components by switching between simultaneously available environments or services. The context in which a blue-green deployment strategy is used can vary from switching between data centers, web servers in a single data center, or microservices in a Pivotal Cloud Foundry (PCF) deployment.

In a microservices architecture, it’s often challenging to monitor and measure the performance of a microservice updated via blue-green deployment, specifically when determining the impact on consumers of the service, the overall business process, and existing service level agreements (SLAs).

But there’s good news. PCF—with its built-in router and commands for easily managing app requests, and its sound orchestration of containerized apps—makes implementing a blue-green deployment trivial.

Our Example App

In this blog, we’ll focus on a simplified PCF Spring Boot microservice that implements a REST service to process orders, and a single orders endpoint for posting new orders. For simplicity’s sake, this service has a single “downstream” dependency on an account microservice, which it calls via the account service REST API.

You’ll find the example app here:

The order-service will use the account-service to create an account. In this scenario, an order is submitted and a user requests a new account created as part of the order submission.

Deploying the Green Version of the Account Service

We will target the account-service to perform a blue-green deployment of a new version of the account-service, which includes some bug fixes. We’ll perform the blue-green deployment using the CF CLI map-route and unmap-route commands.

When we push the account-service app, we’ll adopt an app-naming strategy that appends -blue or -green to the app name, and assume our deployment pipeline would automatically switch between the two prefixes from one deployment to the next.

So our initial deployment, based on the manifest.yml here, would be:

$ cf push account-service-blue -i 5

After pushing this app, we create the production route and map it to the blue version of the app using cf map-route.

$ cf map-route account-service-blue apps.<pcf-domain> –hostname

Then we create the user-provided service.

$ cf cups account-service -p ‘{ “route”:

And bind the order-service to this user-provided service by referencing it in the order-service’s manifest.yml.

When pushed, the order-service app consumes the account-service route from its environment variables, and uses this route to communicate with the account-service, regardless of whether it’s the blue or green version of the account-service.

The initial deployment shows both the order and blue account-service running with five instances.

Both services have started the AppDynamics APM agent integrated in the Java buildpack and are reporting to an AppDynamics controller. The flowmap for the order service/orders endpoint shows the requests flowing from the order-service to the account-service, and the response time averaging 77 milliseconds (ms). A majority of that time is being consumed by the account-service. The instances are represented as nodes on the flowmap:


Performing Blue/Green Deployment

Now we’re ready to push the updated “green” account-service that implements fixes to known issues. We’ll push it and change the app name to “account-service-green” so it’s deployed separately from account-service-blue.

$ cf push account-service-green -i 5

At this point, the Apps Manager shows both versions of the account-service app running, but only the blue version is receiving traffic.


We can validate this by referencing a monitoring dashboard that displays call per minute and response time for blue and green versions of the account-service. Below, the dashboard shows no activity for Account-service-green.


This dashboard distinguishes between blue and green versions by applying a filtering condition, which matches node names that start with “account-service-blue” or “account-service-green.”


This node-matching criteria will match the nodes’ names assigned by the AppDynamics integration included in the Java buildpack, which uses the pattern <pcf-app-name>:<instance id>. Below is a list of the node names reporting under the accountservice tier that shows this pattern.


To complete the blue-green deployment, we use the cf map-route command to change the routing from the blue to green version.

$ cf map-route account-service-green apps.<pcf-domain> –hostname prod-account-service

$ cf unmap-route account-service-blue apps.<pcf-domain> –hostname prod-account-service

This instructs the CF router to route all requests to prod-account-service.<pcf-domain> (used by the order-service app) to the green version. At this point, we want to evaluate as quickly as possible the performance impact of the green version on the order-service.


Our blue-green dashboard shows the traffic has indeed switched from the blue to the green nodes, but performance has degraded. Our blue version of the account-service was averaging well below 100 ms, but the green version is showing an uptick to around 150 ms (artificially introduced for the sake of example).

We see a proportional impact on the order service, which is taking an additional 100 ms to process requests. This could be a case where a rollback is necessary, which again is straightforward using cf map-route and unmap-route.

Baselines and Health Rules

Rather than relying strictly on dashboards to decide whether to rollback a deployment, we can establish thresholds in health rules that compare performance to baselines based on historical behavior.

For example, we could create health rules with thresholds based on the baseline or average performance of the order-service, and alert if the performance exceeds that baseline by two standard deviations.


When we deployed the green version, we were quickly alerted to a performance degradation of the order-service, as shown in our dashboard:


We also defined a health rule that focuses on the aggregate baseline of the account-service (including average performance of all blue and green versions) to determine if the latest deployment of the account-service is behaving poorly. Again, this produced an alert based on our poorly performing green version:


The ability to identify slower-than-normal transactions and capture detailed diagnostic data is also critical to finding the root case.


In the case of the slower account-service version, we can drill down within snapshots to the specific lines of code responsible for the slowness.


Monitoring Impact on a Business Process

In the previous dashboards, we tracked the impact of an updated microservice on clients and the overall application. If the microservice is part of a larger business process, it would also be important to compare the performance of the business process before and after the microservice was updated via blue-green deployment.

In the dashboard below, the order microservice, which depends on the account microservice, may be part of a business process that involves converting offers to orders. In this example, conversion rate is the key performance indicator we want to monitor, and a negative impact on conversion rates would be cause to consider a rollback.

The Power of APM on PCF

While PCF makes deploying microservices relatively simple, monitoring the impact of updates is complex. It’s critical to have a powerful monitoring solution like AppDynamics, which can quickly and automatically spot performance anomalies and identify impacts on service clients and overall business processes.

Monitoring Kubernetes and OpenShift with AppDynamics

Here at AppDynamics, we build applications for both external and internal consumption. We’re always innovating to make our development and deployment process more efficient. We refactor apps to get the benefits of a microservices architecture, to develop and test faster without stepping on each other, and to fully leverage containerization.

Like many other organizations, we are embracing Kubernetes as a deployment platform. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.

At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be OK initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters—both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.

It should come as no surprise that we use our own tools to monitor our apps, including those deployed to OpenShift clusters. Kubernetes is just another layer of infrastructure. Along with the server and network visibility data, we are now incorporating Kubernetes and OpenShift metrics into the bigger monitoring picture.

In this blog, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments. (For more hands-on advice, read my blog Deploying AppDynamics Agents to OpenShift Using Init Containers.)

OpenShift Cluster Monitoring

For OpenShift cluster monitoring, we use two plug-ins that can be deployed with our standalone machine agent. AppDynamics’ Kubernetes Events Extension, described in our blog on monitoring Kubernetes events, tracks every event in the cluster. Kubernetes Snapshot Extension captures attributes of various cluster resources and publishes them to the AppDynamics Events API. The snapshot extension collects data on all deployments, pods, replica sets, daemon sets and service endpoints. It captures the full extent of the available attributes, including metadata, spec details, metrics and state. Both extensions use the Kubernetes API to retrieve the data, and can be configured to run at desired intervals.

The data these plug-ins provide ends up in our analytics data repository and instantly becomes available for mining, reporting, baselining and visualization. The data retention period is at least 90 days, which offers ample time to go back and perform an exhaustive root cause analysis (RCA). It also allows you to reduce the retention interval of events in the cluster itself. (By default, this is set to one hour.)

We use the collected data to build dynamic baselines, set up health rules and create alerts. The health rules, baselines and aggregate data points can then be displayed on custom dashboards where operators can see the norms and easily spot any deviations.

An example of a customizable Kubernetes dashboard.

What We Monitor and Why

Cluster Nodes

At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, as well as a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.

From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd. Ironically, it is often monitoring solutions that are known for bringing clusters down by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:

–etcd-servers-overrides =/events#;https://etcd2.;https://etcd3.

From the cluster perspective we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).

A dashboard widget with infrastructure and cluster metrics combined.

Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.

Pod Evictions

Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.

Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold.  The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.

If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.

Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us when a node is about to experience memory or disk pressure, or when a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).

We also look for daemon pod failures (FailedDaemonPod), as they are often associated with cluster health rather than issues with the daemon set app itself.

Pod Issues

Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs, for example, may happen by design through auto-scaling when the app load subsides. They may also be issued manually or in error, and can expose the application to an excessive load.

Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable threshold—one minute, for example. In AppDynamics, we also have the luxury of baselining any metric and then tracking any configurable deviation from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, –serialize-image-pulls, which is set to “true” by default. It means that images will be loaded one at a time. Change the flag to “false” if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.

Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.

User Errors

Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec—for example, in the image name—may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed up deployments.

Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g., mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.

Speaking of quota violations, are you setting resource limits in your deployment specs?

Policing the Cluster

On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.

Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale-up or redeployment of these pods will result in failures.

The health probes, readiness and liveness are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.

The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.

We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.

Routing and Endpoint Tracking

As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.

We also monitor bad endpoints pointing at old (deleted) pods, which effectively cause downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –kube-api-burst and –kube-api-qps config values of kube-controller-manager.

Every metric we expose on the dashboard can be viewed and analyzed in the list and further refined with ADQL, the AppDynamics query language. After spotting an anomaly on the dashboard, the operator can drill into the raw data to get to the root cause of the problem.

Application Monitoring

Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.

In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).

Leveraging K8s Metadata

Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.

Monitoring Kubernetes or OpenShift is just a part of what AppDynamics does for our internal needs and for our clients. AppDynamics covers the entire spectrum of end-to-end monitoring, from the foundational infrastructure to business intelligence. Inherently, AppDynamics is used by many different groups of operators who may have very different skills. For example, we look at the platform as a collaboration tool that helps translate the language of APM to the language of Kubernetes and vice versa.

By bringing these different datasets together under one umbrella, AppDynamics establishes a common ground for diverse groups of operators. On the one hand you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of APM or managers looking at business metrics, both of whom may not be intimately familiar with Kubernetes. These groups can now have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard.

Learn more about how AppDynamics can help you monitor your applications on Kubernetes and OpenShift.

How Top Investment Banks Accelerate Transaction Time and Avoid Performance Bottlenecks

A complex series of interactions must take place for an investment bank to process a single trade. From the moment it’s placed by a buyer, an order is received by front-office traders and passed through to middle- and back-office systems that conduct risk management checks, matchmaking, clearing and settlement. Then the buyer receives the securities and the seller the corresponding cash. Once complete, the trade is sent to regularity reporting, which insures the transaction was processed under the right regulatory requirements. One AppDynamics customer, a major financial firm, utilizes thousands of microservices to complete this highly complex task countless times throughout the day.

To expedite this process, banks have implemented straight-through processing (STP), an initiative that allows electronically entered information to move between parties in the settlement process without manual intervention. But one of the banks’ biggest concerns with STP is the difficulty of following trades in real-time. When trades get stuck, manual intervention is needed, often impacting service level agreements (SLAs) and even trade reconciliation processes. One investment firm, for instance, told AppDynamics that approximately 20% of its trades needed manual input to complete what should have been a fully automated process—a bottleneck that added significant overhead and resource requirements. And with trade volumes increasing 25% year over year, the company needed a fresh approach to help manage its rapid growth.

AppDynamics’ Business Transactions (BT) enabled the firm to track and follow trades in real time, end-to-end through its systems. The BT traces through of all the necessary systems and microservices—applications, databases, third-party APIs, web services, and so on—needed to process and respond to a request. In investment banking, a BT may include everything from placing an order, completing risk checks or calculations, booking and confirming different types of trades, and even post-trade actions such as clearing, settlement and regularity reporting.

The AppDynamics Business Journey takes this one step further by following a transaction across multiple BTs; for example, following an individual trade from order through capture and then to downstream reporting. The Business Journey provides true end-to-end, time-enabling tracking against SLAs, and traces the transaction across each step to monitor performance and ensure completion.

Once created, the Business Journey allows you to visualise key metrics with out-of-the-box dashboards.

Real-Time Tracking with Dashboards

Prior to AppDynamics, one investment bank struggled to track trades in real time. They were doing direct queries on the database to find out how many trades had made it downstream to the reporting database. This method was slow and inefficient, requiring employees to create and share small Excel dashboards, which lacked real-time trade information. AppDynamics APM dashboards, by comparison, enabled them to get a real-time, high-level overview of the health and performance of their system.

After installing AppDynamics, the investment bank instrumented a dashboard to show all the trades entering its post-trade system throughout the day. This capability proved hugely beneficial in helping the firm monitor trading spikes and ensure it was meeting its SLAs. And Business IQ performance monitoring made it possible to slice and dice massive volumes of incoming trades to gain real-time insights into where the transactions were coming from (i.e., which source system), their value, whether they met the SLAs, and which ones failed to process. Additionally, AppDynamics Experience Level Management provided the ability to report compliance against specific processing times.

Now the bank could automate complex processes and remove inefficient manual systems. Prior to AppDynamics, there was a team dedicated to overseeing more than 200 microservices. They had to determine why a particular trade failed, and then pass that information onto the relevant business teams for follow-up to avoid losing business. But too often a third-party source would send invalid data, or update its software and send a trade in an updated format unfamiliar to the bank’s backend system, creating a logistical mess too complex for one human to manage. With Business IQ, the bank was able to immediately spot and follow up on invalid trades.

Searching for Trades Across All Applications

Microservices offer many advantages but can bring added complexity as well. The investment bank had hundreds of microservices but lacked a fast and efficient way to search for an individual trade. In the event of a problem, they would take the trade ID and look into the log files of multiple microservices. On average, they had to open up some 40 different log files to locate a problem. And although the firm had an experienced support staff that knew the applications well, this manual process wasn’t sustainable as newer, inexperienced support people were brought onboard. Nor would this system scale as trade volume increased.

By using Business IQ to monitor every transaction across all microservices, the bank was able to easily monitor individual trade transactions throughout the lifecycle. And by capturing the trade ID, as well as supplementary data such as the source, client, value and currency, they could then go into AppDynamics Application Analytics and very quickly identify specific transactions. For example, they could enter the trade ID and see every transaction for the trade across the entire system.

This feature was particularly loved by the support staff, which now had immediate access to all of a trade’s interactions within a single screen, as well as the ability to easily drill down to find the find the root cause of a failed transaction.

Tracking Regulatory SLAs in Real Time

Prior to AppDynamics, our customer didn’t have an easy way to track the progress of a trade in real time. Rather, they were manually verifying that trades were successfully being sent to regulatory reporting systems, as well as ensuring that this was completed within the required timeframe. This was difficult to do in real time, meaning that when there was an issue, often it was not found until after the SLA had been breached. With AppDynamics they were able to set up a dashboard to visualise data in real time; the team then set up a health rule to indicate if trade reporting times were approaching the SLA. They also configured an alert that enabled them to proactively see and resolve any issues ahead of an SLA breach.

Proactively Tracking Performance after Code Releases

The bank periodically introduces new functionality to meet the latest business or regulatory requirements, in particular MiFID II, introduced to improve investor protection across Europe by harmonizing the rules for all firms with EU clients. Currently, new releases happen every week, but this rate will continue to increase. These new code releases introduce risk, as previous releases have either had a negative impact on system performance or have introduced new defects. In one two-month period, for instance, the time required to capture a trade increased by about 20%. If this continued, the bank would have had to scale out hugely—buying new hardware at significant cost—to avoid breaching its regulatory SLA.

The solution was to create a comparative dashboard in AppDynamics that showed critical Business Transactions and how they were being changed between releases (response times, errors, and so on). If any metric degraded from the previous version or deviated from a certain threshold, it would be highlighted on the dashboard in a different color, making it easier to decide whether to proceed with a rollout or determine which new feature or change had caused the deviation.

Preventing New Hardware Purchases

After refining its code based on AppDynamics’ insights, the bank saw a dramatic 6X performance improvement. This saved them from having to—in their words—“throw more hardware at the problem” by buying more CPU processing power to push through more trades.

By instrumenting their back office systems with AppDynamics, the bank gained deep insights that enabled them to refine their code. For instance, calls to third-party APIs were taking place unnecessarily and trades were being captured unintentionally within multiple different databases. Without AppDynamics, it’s unlikely this would have been discovered. The insight enabled the bank to make some very simple changes to fine-tune code, resulting in a significant performance improvement and enabling the bank to save money by scaling with their existing hardware profile.

Beneficial Business Outcomes

From the bank’s perspective, one of the greatest gains of going with AppDynamics was the ability to follow a trade through its many complex services, from the moment an order is placed, through to capture and down to regularity reporting. This enabled them to improve system performance, avoid expensive (and unnecessary) hardware upgrades, quickly search for trade IDs to locate and find the root cause of issues, and proactively manage SLAs.

See how AppDynamics can help your own business achieve positive outcomes.

How Anti-Patterns Can Stifle Microservices Adoption in the Enterprise

In my last article, Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension, we went over some deployment and communication patterns that help keep microservices manageable as you use more of them. I also promised that in my next post, I’d get into how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess. So let’s dig in.

First things first: Patterns are awesome.

They help formalize ideas into reusable chunks that you can distribute and communicate easily to your teams. Here are some useful things that patterns do for engineering departments of any size:

  • Make ideas distributable

  • Lower the barriers to success for junior members

  • Create building blocks

  • Create consistency in disparate/complex systems

Patterns are usually an intentional act. When we put patterns in place, we’re making a clear choice and acting on a decision to make things less painful. But not all patterns are helpful. In the case of the anti-pattern, it has the potential to create more trouble for the engineering team and the business.

The Anatomy of An Anti-Pattern

Like patterns, anti-patterns tend to be identifiable and repeatable.

Anti-patterns are rarely intentional, and usually you identify them long after their effects are visible. Individuals in your organization often make well-meaning (if poor) choices in the pursuit of faster delivery, rushed deadlines, and so on. These anti-patterns are often perpetuated by other employees who decide, “Well, this must be how it’s done here.”

In this way, anti-patterns can become a very dangerous norm.

For companies that want to migrate their architecture to microservices, anti-patterns are a serious obstacle to success. That’s why I’d like to share with you a few common anti-patterns I’ve seen repeated many times in companies making the switch to microservices. These moves eventually compromised their progress and created more of the problems they were trying to avoid.

Data Taffy

The first anti-pattern is the most common—and the most subtle—in the chaos and damage it causes.

The Problem

The data taffy anti-pattern can manifest in a few different ways, but the short explanation is that it occurs when all services have full access to all objects in the database.

That doesn’t sound so bad, right?

You want to be able to create complex queries and complex data-ingestion scenarios to go across many domains. So at first glance, it makes sense to have everything call what it needs directly from the database. But that’s a problem when you need to scale an individual domain in your application. Rarely does data grow uniformly across all domains, but rather does so in bursts on individual domains. Sometimes it’s very difficult to predict which domains will grow the fastest. The entangled data becomes a lot like taffy: difficult to pull apart. It stretches and gets stuck in the cogs of business.

In this scenario, companies will have lots of stored procedures, embedded complex queries across many services, and object relationship managers all accessing the database—each with its own understanding of how a domain is “supposed” to be used. This nearly always leads to data contamination and performance issues.

But there are even bigger challenges, most notably when you need to make structural changes to your database.

Here’s an example based on a real-life experience I had with a large, privately owned company, one that started small and expanded rapidly to service tens of thousands of clients. Say you have a table whose primary key is an int starting at 0. You have 2,147,483,647 objects before you’ll run out of keys—no big deal, right?

So you start building out services, and this table becomes a cornerstone object in your application that every other domain touches in some meaningful way. Before you know it, there are 125 applications calling into this table from queries or stored procedures, totaling some 13,000 references to the table. Because this is a core table it gets a ton of data. Soon you’re at 2,100,000,000 objects with 10,000,000 new records being added daily.

You have four days before things go bad—real bad.

You try adding negative values to buy time, not realizing that half the services have hard-coded rules that IDs must be greater than 0. So you bite the bullet and manually scrub through EVERY SERVICE to find every instance of every object that has been created that uses this data, and then update the type from an integer to a large integer. You then have to update several other tables, objects, and stored procedures with foreign key relationships. This becomes a hugely painful effort with all hands on deck frantically trying to keep the company’s flagship product from being DOA.

Clearly, not an ideal scenario for any company.

Now if this domain was contained behind a single service, you’d know exactly how the table is being used and could find creative solutions to maintain backwards compatibility. After all, you can do a lot with code that you simply can’t do by changing a data value in a database. For instance, in the example above, there are about 800 million available IDs that could be reclaimed and mapped for new adds, which would buy enough time for a long-term plan that doesn’t require a frantic, all-hands-on-deck approach. This could even be combined with a two-key system based on a secondary value used to partition the data effectively. In addition, there’s one partitionable field we could use to give us 10,000x more available integers, as well as a five-year window to create more permanent solves with no changes to any consuming services.

This is just one anecdote, but I have seen this problem consistently halt scale strategies for companies at crucial times of growth. So try to avoid this anti-pattern.

How to Solve

To solve the data taffy problem, you must isolate data to specific domains only accessible via services designed to service them. The data may start on the same database but use schema and access policy to limit access to a single service. This enables you to change databases, create partitions, or move to entirely new data storage systems without any other service or system having to know or care.

Dependency Disorder

Say you’ve switched to microservices, but deployments are taking longer than ever. You wish you had never tried to break down the monolith. If this sounds familiar, you may be suffering from dependency disorder.

The Problem

Dependency disorder is one of the easiest anti-patterns to detect. If you have to know the exact order that services must be deployed to keep them from failing, it’s a clear signal the dependencies have become nested in a way that won’t scale well. Dependency disorder generally comes from domains calling sideways from one domain’s stack to another (instead of down the stack from the UI to the gateway) and then to the services that enable the gateway. Another big problem resulting from dependency disorder: unknown execution paths that take arbitrarily long times to execute.

How to Solve

An APM solution is a great starting point for resolving dependency disorder problems. Try to utilize a solution that provides a complete topology of your service execution paths. By leveraging these maps, you can make precision cuts in the call chain and refocus gateways to make fan-out calls that execute asynchronously rather than doing sideways calls. For some examples of helpful patterns, check out part one of this series. Ideally, we want to avoid service-to-service calls that create a deep and unmanageable call stack and favor a wider set of calls from the gateway.


Microliths basically are well-meaning, clear-service paths that take dependency disorder to its maximum entropic state.

The Problem

Imagine having a really well-designed service, database and gateway implementation that you decide to isolate into a container—you feel great! You have a neat-and-tidy set of rules for how data gets stored, managed and scaled.

Believing you’ve reached microservice nirvana, you breathe a sigh of relief and wait for the accolades. Then you notice the releases gradually start taking longer and longer, and that data-coupling is happening in weird ways. You also find yourself deploying nearly the entire suite of services with every deployment, causing testing issues and delays. More and more trouble tickets are coming in each quarter, and before you know it, the organization is ready to scrap microservices altogether.

The promise of microservices is that there are no rules—you just put out whatever you want. The problem is that without a clear definition of how data flows down the stack, you’re basically creating a hybrid problem between data taffy and dependency disorder.

How to Solve

The mediation process here is effectively the same as with the dependency disorder. If you are working with a full-blown microlith, however, it will take some diligence to get back to stable footing. The best advice I can give is, try to get to a point where you can deploy a commit as soon as it’s in. If your automation and dependency orders are well-aligned, new service features should always be ready to roll out as soon as the developer commits to the code base. Don’t stand on formality. If this process is painful, do it more. Smooth out your automated testing and deployment so that you can reliably get commits deployed to production with no downtime.

Final Thoughts

I hope this gets the wheels spinning in your head about some of the microservices challenges you may be having now or setting yourself up for in the future. This information is all based on my personal experiences, as well as my daily conversations with others in the industry who think about these problems. I’d love to hear your input, too. Reach out via email,; Twitter,; or LinkedIn, If you’ve got some other thoughts, I’d love to hear from you!

Microservice Patterns That Help Large Enterprises Speed Development, Deployment and Extension

This is the first in a two-part series on microservice patterns and anti-patterns. In this article, we’ll focus on some useful patterns that, when leveraged, can speed up development, deployment, and extension. In the next article, we’ll focus on how microservice patterns can become toxic, create more overhead and unreliability, and become an unmanageable mess.

Microservice patterns assume a large-scale enterprise-style environment. If you’re working with only a small set of services (1 to 5) you won’t feel the positive impact as strongly as organizations with 10, 100, or 1000+ services. My biggest pattern for startups and smaller projects is to not overthink and add complexity for complexity’s sake. Patterns are meant to aid in solving problems—they are not a hammer with which to bludgeon the engineering team. So use them with this in mind.

Open for Extension, Closed for Modification

We’re going to start our patterns talk with a principle rather than a pattern. Software development teams working on microservices get more mileage out of this one principle than most any other pattern or principle. It’s a classic pattern from the SOLID principles of Robert C. Martin (Uncle Bob).

In short, being open for extension and closed for modification means leaving your code open to add new functionality via inheritance but closed for direct modifications. I take a bit of a looser definition that tends to be more pragmatic. My definition is, “Don’t break existing contracts.” It’s fine to add methods but don’t change the signature of existing methods. Nor should you change the functionality of an established method.

Why is This Pattern So Powerful in Microservices?

When you have disparate teams working on different services that have to interoperate with one another, you need a certain level of reliability. I, as a consumer of a service, need to be able to depend on the service in the future, even as new features are added.

How Do We Manifest this Pattern?

Easy. Don’t break your contracts. There’s never a good reason to break an existing production contract. However, there could be lots of good reasons to add to it, thereby making it “open for extension and closed for modification.” For example, if you have to start collecting new data as part of a service, add a new endpoint and set a timeline to depreciate the old service call, but don’t do both in a single step. Likewise with data management: If you need to rename a column of data, just add a new column and leave the old column empty for awhile. When you depreciate an old service, it’s a good time to do any clean-up that goes with that depreciation. If you can adhere to this principle, everyone in your organization with have a better development experience.

Pattern: Enterprise Services with SPA Gateways

When we start building out large applications and moving towards a microservice paradigm, the issue of manageability quickly rises to the surface. We have to address manageability at many layers, and also must consider dependency management. Microservices can quickly become a glued-together mess with tightly coupled dependencies. Instead of having a monolithic “big ball of mud,” we create a “mudslide.”

One way to address these problems is to introduce the notion of Enterprise Domain Services that are responsible for the tasks within different domains in your organization, and then combine that domain-specific logic into more meaningful activities (i.e., product features) at the Single Page Application (SPA) gateway layer. The SPA gateway serves to take some subset of the overall functionality of an application (i.e., a single page worth) and codify that functionality, delegating the “hard parts” (persistence, state management, third-party calls, etc.) off to the associative enterprise services. In this pattern, each enterprise service either owns its own data as a single database, a collection of databases, or as an owned schema as part of a larger enterprise database.

Pattern: SPA Services with Gateway and ETL

Now we are going to ramp up the complexity a bit. One of the big questions people get into when they start down the microservices path is, “How do I join complex data?” In the Enterprise Services with SPA gateways example above, you would just call into multiple services. This is fine when you’re combining two to three points of data, but what about when you need really in-depth questions answered? How do you find, for instance, all the demographic data for one region’s customers who had invoices for the green version of an item in the second quarter of the month?

This question isn’t incredibly difficult if you have all the data together in a single database. But then you might start violating single responsibility principles pretty fast. The goal here then is to delegate that responsibility to a service that’s really good at just joining data via ETL (Extract, Transform, Load). ETL is a pattern for data warehousing where you extract data from disparate data sources, transform the data into something meaningful to the business, and load the transformed data somewhere else for utilization. The team that owns the domain that will be asking these types of demographic questions will be responsible for the care and feeding of services that perform the ETL, the database or schema where the transformed data is stored, and the services(s) that provide access to it.

Why Not Just Make a Multi-Domain Call at the Database?

This is a fair question, and on a small project it may be reasonable to do so. But on large projects with lots of moving parts, each part must be able to move independently. If we are combining directly at the DB level, we’re pretty much guaranteeing that the data will only ever travel together on that single DB, which is no big deal with small volumes of data. However, once we start dealing with tens, hundreds or thousands of terabytes, this becomes more of a big deal as it greatly impacts the way we scale domains independently. Using ETLs and data warehousing strategies to provide an abstraction layer on the movement and combination of our data might require us to update our ETL if we move the data around. But this feat is much more manageable than trying to untangle thousands of nested, stored procedures across every domain.

Closing Thoughts

Remember, these are just some of the available patterns. Your goal here is to solve problems, not create more problems. If a particular pattern isn’t working well for your purpose, it’s okay to create your own, or mix and match.

One of the easiest ways to get a handle on large projects with lots of domains is to use tools like AppDynamics with automatic mapping functionality to get a better understanding of the dependency graph. This will help you sort out the tangled mess of wires.

Remember, the best way to eat an elephant is one bite at a time.

In my next blog, we’ll look at some common anti-patterns, which at first may seem like really good ideas. But anti-patterns can cause problems pretty quickly and make it difficult to scale and maintain your projects.

Advances In Mesh Technology Make It Easier for the Enterprise to Embrace Containers and Microservices

More enterprises are embracing containers and microservices, which bring along additional networking complexities. So it’s no surprise that service meshes are in the spotlight now. There have been substantial advances recently in service mesh technologies—including Istio’s 1.0, Hashi Corp’s Consul 1.2.1, and Buoyant merging Conduent into LinkerD—and for good reason.

Some background: service meshes are pieces of infrastructure that facilitate service-to-service communication—the backbone of all modern applications. A service mesh allows for codifying more complex networking rules and behaviors such as a circuit breaker pattern. AppDev teams can start to rely on service mesh facilities, and rest assured their applications will perform in a consistent, code-defined manner.

Endpoint Bloom

The more services and replicas you have, the more endpoints you have. And with the container and microservices boom, the number of endpoints is exploding. With the rise of Platform-as-a-Services and container orchestrators, new terms like ingress and egress are becoming part of the AppDev team vernacular. As you go through your containerization journey, multiple questions will arise around the topic of connectivity. Application owners will have to define how and where their services are exposed.

The days of providing the networking team with a context/VIP to add to web infrastructure—such as over port 443—are fading. Today, AppDev teams are more likely to hand over a Kubernetes YAML to add to the Ingress controller, and then describe a behavior. Example: the shopping cart Pod needs to talk to the shopping cart validation Pod, which can only be accessed by the shopping cart because the inventory is kept on another set of Reddis Pods, which can’t be exposed to the outside world.

You’re juggling all of this while navigating constraints set by defined and deployed Kubernetes networking. At this point, don’t be alarmed if you’re thinking, “Wow, I thought I was in AppDev—didn’t know I needed a CCNA to get my application deployed!”

The Rise of the Service Mesh

When navigating the “fog of system development,” it’s tricky to know all the moving pieces and connectivity options. With AppDev teams focusing mostly on feature development rather than connectivity, it’s very important to make sure all the services are discoverable to them. Investments in API management are the norm now, with teams registering and representing their services in an API gateway or documenting them in Swagger, for example.

But what about the underlying networking stack? Services might be discoverable, but are they available? Imagine a Venn diagram of AppDev vs. Sys Engineer vs. SRE: Who’s responsible for which task? And with multiple pieces of infrastructure to traverse, what would be a consistent way to describe networking patterns between services?

Service Mesh to the Rescue

Going back to the endpoint bloom, consistency and predictability are king. Over the past few years, service meshes have been maturing and gaining popularity. Here are some great places to learn more about them:

Service Mesh 101

In the Istio model, applications participate in a service mesh. Istio acts as the mesh, and then applications can participate in the mesh via a sidecar proxy—Envoy, in Istio’s case.

Your First Mesh

DZone has a very well-written article about standing up your first Java application in Kubernetes to participate in an Istio-powered service mesh. The article goes into detail about deploying Istio itself in Kubernetes (in this case, MinuKube). For an AppDev team, the new piece would be creating the all-important routing rules, which are deployed to Istio.

Which One of these Meshes?

The New Stack has a very good article comparing the pros and cons of the major service mesh providers. The post lays out the problem in granular format, and discusses which factors you should consider to determine if your organization is even ready for a service mesh.

Increasing Importance of AppDynamics

With the advent of the service mesh, barriers are falling and enabling services to communicate more consistently, especially in production environments.

If tweaks are needed on the routing rules—for example, a time out—it’s best to have the ability to pinpoint which remote calls would make the most sense for this task. AppDynamics has the ability to examine service endpoints, which can provide much-needed data for these tweaks.

For the service mesh itself, AppDynamcs in Kubernetes can even monitor the health of your applications deployed on a Kubernetes cluster.

With the rising velocity of new applications being created or broken into smaller pieces, AppDynamics can help make sure all of these components are humming at their optimal frequency.