Monitoring is complicated, with multiple tools in use and many more available to us for little or no cost. These tools provide many ways to be alerted when issues are detected as well as valuable insight into solving those issues quickly. Given their flexibility, especially with today’s powerful and scalable log analytics solutions, they can be applied to a wide range of use cases. With their ability to ingest unstructured and structured data, tools like Splunk, Sumo Logic, or the open-source Elastic or ELK Stack are the Swiss Army knives of analytics tools. I keep a copy of Splunk on my laptop and many Linux boxes for this reason. These tools are so flexible, users end up applying them to meet any use case, even without understanding the challenges in doing so. Oftentimes the obstacles become apparent, as these tools are not well-suited to handle high volume metrics, nor do they have the troubleshooting workflows or out of the box views to make user troubleshooting more productive.
One use case which comes up occasionally is using these products for APM. This is even something Gartner has called out, and has been seen many times with limited outcomes. The core differentiator between log analytics tools and APM is that APM tools are designed specifically to instrument software (with an agent), measure transaction performance (typically with end-to-end tracing), capture front-end performance (synthetic and real), and provide code-level visibility in a production environment.
The beauty of logs is that vendors and developers can write code that outputs useful (and sometimes useless) data that can be ingested and analyzed. An application developer or vendor may log each transaction and include a specific transaction ID for each request. As the request moves between components, that transaction ID can be persisted, which allows for transaction tracking use cases. It might look something like this example, which is using a combination of Zipkin (tracing) and the ELK Stack. If you’re able to persist this transaction ID down to other components, you could query for and pull up a trace in your log analysis tool of choice.
This requires either software agents to be built and modified for each language (I’m hoping OpenCensus solves this), or the developers of these software systems manually instrumenting code to log the right data through each component, which can be a challenge. This introduces scalability issues and isn’t really feasible in large organizations with many small teams — aside from the Facebook/Amazon/Netflix/Google-type companies who have the resources and culture to build platform services and invest a lot of money in people, custom software, and infrastructure to make this work at scale. If the right data is logged, a transaction could be followed, and possibly show metrics or other data in the trace segments (hops). This requires knowing the transaction ID and manual queries in the log analytics tool, alternately querying on a metric and finding a useful transaction, and then querying for that transaction ID to extract the right log entries.
APM does a lot more than just showing you traces and timing. Most APM tools start by instrumentation of browser and mobile apps with runtime or embedded and typically automated agents. These agents collect detailed metrics and configurations about the runtime, but also trace from the front-end to the back-end. Within these traces, there’s not just timing, but also detailed diagnostic data including code-level insights, queries, and other API calls which are collected. Some APM tools do monitoring of logical and physical infrastructure. This includes monitoring of servers, containers, and network performance. Over time, we’ll see more extensible metric systems which allow the correlation between a metric and the trace where it was measured. Today, a few APM tools have automated correlation for log messages from the code, which makes logs more useful in context.
Let’s look at the latest trend, known as “Observability”. It focuses on a much smaller subset of data and a broader set of use cases. Those include not only monitoring and diagnostics, but also business reporting. These same use cases are done with APM tools and even logs, but the richness of data collected, and modeling within an APM tool, creates much more meaning around the data.
In Observability or log analytics, there is likely not this type of tracing between services unless implemented by developers who must code the “stitching” of some kind of transaction ID. If the importance of tracing is understood or required to diagnose highly distributed architectures, developers must manually inject these into the protocols while communicating with other systems (or use libraries which implement something for them). This is the OpenTracing approach, which will not work for most users of APM tools. This could work for completely custom systems (which is not the case in the enterprise), but it requires diligence between each team to add this into their practices.
Standardization on what should be recorded and where is another challenge. The meaning of these metrics is largely open to interpretation, creating confusion as the software ages. This is why using auto-instrumentation whenever possible makes things easier.
When new capabilities or languages are introduced, the new agents working at runtime or compile time don’t need to be manually configured, as they already know how to collect and correlate data.
Aside from these tools for instrumenting code, there are likely dependencies on other data stores. This could be a database, big data back-end, object-based filesystems, or many of each. Monitoring of these variables doesn’t just require an understanding of the transactions (tracing and timing), but also operational views of these systems. These could be database monitoring, metrics, and events coming from your cloud-based systems. They’re a major part of monitoring an application, but are generally missing from Observability or tracing systems.
The fad of Observability will pass, just like the new vendors touting this terminology, as their technology gradually becomes part of larger vendors who can offer more thorough monitoring and visibility systems. This is how history works, and how it will continue to work. Log analytics will be around for the long haul, both for operational and security use cases where vendor and custom logs contain useful information. Since these flexible systems can fill so many use cases, they will sometimes be applied to solve the wrong problems or be applied with great cost and complexity. Understanding the need for log systems and APM is essential for different use cases, versus trying to adapt log and Observability systems to APM use cases.