OpenTelemetry™ is a complete telemetry system for monitoring both modern, distributed architectures in the cloud and more traditional on-prem applications. Debuting in 2019, OTel is fast becoming the de facto standard for telemetry observability, and AppDynamics provides support for OpenTelemetry-based application and infrastructure monitoring across all our offerings. This gives us many opportunities to optimize the way we perform testing on our products that consume OTel data.
In this two-part blog, we’ll explore how we developed our OpenTelemetry data generator and how it provides a seamless product-validation experience across teams working in AppDynamics Cloud. We’ll also share details on the tool’s recent open source release, and show how to use it.
Before diving into details, let’s look at why the OpenTelemetry data generator was needed in the first place.
To test or demo any product, there needs to be some data available in the system. This is more important with application monitoring platforms since such systems are intensively data-driven. Obviously, we could monitor an actual underlying infrastructure, but other than the cost of maintaining said infrastructure, we wouldn’t have control over the telemetry data being generated or, more importantly, know the specific data points.
This presented a cross-team challenge for our squads working on the platform: we needed a cost-effective and performant solution. In short, we needed a tool to generate simulated monitoring data.
When we talk about application testing — other than operational testing such as disaster/fault and tolerance/chaos testing — we broadly categorize it into three groups, each with its own nuances and requirements:
- Functional testing: tuning the data at a high level of granularity to test various positive outcomes in the system.
- Negative testing: the ability to omit and provide bad data for certain fields in the input.
- Performance testing: focusing more on the scale/volume of the data rather than the inherent details.
Since our systems are intensely data-driven, there’s a tradeoff between the flexibility of the data to be generated for functional testing versus the scale of the data. To be more specific, if we want more control over the data, our data generator would need to take a larger number of decisions and perform a higher number of transformations to generate each data point, which would limit its usefulness as a performance-testing tool.
After analyzing in-house requirements and the needs of the OpenTelemetry community, we chose to favor the functional testing side, where we could offer greater customization of the data being generated. And while that didn’t completely eliminate our ability to use the same tool for load/performance testing, it did (to some extent) limit the scale of the data to be generated.
Our primary goal was to give users of the data generator tool nearly complete flexibility in how each data point is generated. This means giving them control over the details of the entities/resources and their metrics, events, logs and traces (MELT) data reported by the entities. For example, a Kubernetes (K8s) pod is an entity and can report metrics such as memory usage as well as send K8s events. Similarly, an application running on the pod can report logs and traces.
At a high level, this doesn’t sound all that complex. But if we dig into the OpenTelemetry spec, it’s apparent there are myriad fields for each type and many ways to combine these fields to tell the state of the entity. It soon became obvious that we would need to come up with our own specification for the tool to read and generate data. These definition files are written in the YAML format.
Let’s take a quick tour of the OpenTelemetry spec at a very high level and also look at examples of our corresponding definition files for data generation.
The OpenTelemetry Specification defines each MELT packet as having two sections: resource and reported telemetry (MELT) data by the resource. The resource section is composed only of a list of pairs (key, value) denoted as a resource’s attributes. These must uniquely identify each resource. For example, each pod could have the following attributes:
- Pod IP
- Pod UID
- Pod name
- Pod cluster name
- Node UID
Adding a value doesn’t seem complex — until we realize we may need to simulate hundreds or even thousands of pods. So how do we assign unique values to the same attributes for each entity? In our entity definition, we define each entity type and then add details. For example, to simulate having 250 pods, we can write the definition for it as follows:
Expressions such as IPv4Sequence and UUIDFromStringCounter are methods implemented in our code, and they are executed using the Jakarta Expression Language (JEL). When the OpenTelemetry data generator tool is generating entities, it looks at these expressions and evaluates them every time for a new entity. So the pods would be generated as:
Pod #1 ->
“k8s.pod.ip” : “18.104.22.168”
“k8s.pod.uid” : “9bcdea33-5c08-35dd-a2d2-0ba55f754321”
Pod #2 ->
“k8s.pod.ip” : “22.214.171.124”
“k8s.pod.uid” : “d1f672b0-1af7-3b1c-8a4b-72ecfc5f1bf2”
Pod #3 ->
“k8s.pod.ip” : “126.96.36.199”
“k8s.pod.uid” : “67a0967a-e8c5-3720-a1ee-ec9882ac9774”
and so on.
Other than being an intuitive way of writing unique attributes for each resource, this is also deterministic — that is, every time the tool is executed with this definition, the same set of pods will be generated. In addition to the three expression methods listed above, there are a few more with the list expanding on demand. And there are more fields for specifying things like parent-child relationships between entities (pod-node-cluster), copying of attributes from parent to child types, and entity churn (pod crashing and restarting). We’ll link to detailed documentation at the end of part two of this blog.
OpenTelemetry defines anywhere from three to five different types of metrics depending on the release version of the specification. As of the latest release (v0.19.0) there are five: sum, gauge, summary, histogram and exponential histogram. Each caters to a particular use case and must be interpreted correspondingly. The metrics data model specification addresses these in detail.
While the OpenTelemetry metric type can be provided using simple fields in the metric definitions, the most important thing is to specify the value of those metrics and how they change with each packet. We may want to continuously increase/decrease the value, have it hover around a range in a deterministic manner or let it be completely random. This is also solved by JEL expression methods. Let’s say we have the memory-used metric represented as a gauge type, reported by pod entities, and we want it to hover around a certain range with some deviations but always generating the same values whenever it is executed. We can specify it as:
reportingEntities: [ pod ]
The absolute SineSequence expression basically gets the trigonometric sine of an angle, changes it to an absolute value (so it’s not negative) and then performs the arithmetic specified as the parameter. The angle in this case is the counter for how many times this expression is evaluated. The first three values would be generated as follows:
#1 -> absolute( sin(1) ) * 7000 = 5890.3
#2 -> absolute( sin(2) ) * 7000 = 6365.08
#3 -> absolute( sin(3) ) * 7000 = 987.84
Just like the attribute value expression functions, we have quite a few metric value functions to cater to specific requirements. Also, there are fields to specify how frequently each metric is generated, the number of times it’s generated, and an option to add attribute expressions for each metric — the same expressions we have for entities, which are evaluated and can change every time the metric is generated.
Logs and events
OpenTelemetry recently added semantic support for events, which uses the existing log records data model. TL;DR events are basically log packets with two attributes added: event.domain and event.name. Just as the focus of a metric data point is on its value, for a log or event the focus is on its severity. However, unlike metrics, logs/events present two new requirements:
- If we have 250 pods, all will emit a metric, but not all will emit events at the same rate.
- There’s a scope for load-testing with logs or events. With metrics, the number of packets generated in a given time window is limited by the frequency and number of entities emitting the metrics. But any entity can emit an arbitrary number of events.
With this in mind, we can represent a basic log type as follows:
severityOrderFunction: 'severityDistributionCount(["ERROR", "WARN", "DEBUG"], [1, 1, 4])'
The severity function — again a JEL expression — sets the severity from the first array of a packet based on the corresponding count in the second array. So, the first two packets get the severity of ERROR & WARN and the next 4 get DEBUG, after which the cycle repeats. The data generator automatically sets a log4j2-formatted log message in the body of the packet based on the severity selected. The reporting entities count specifies which and how many of those entities will report this log. Finally, the copy count helps with the load testing requirement. It simply copies over the same packet 100 times as specified for each entity. This results in 100 packets being generated for each pod and container in every payload.
As with metrics, there are fields to set the payload frequency, payload counts and attributes for each log. We simply add the required attributes for the log to be processed as an event.
Traces and spans are where simulations can get complex. Why? Because each trace, representing the complete flow of a request, is composed of a tree of spans, each representing the request at a particular place in the infrastructure, such as an HTTP endpoint, database, messaging system, service and so on.
Further, there are quite a few details to consider for each span — such as its duration, whether it was successful, the impact on the complete request flow if one of the spans is in error, and so on. The trace definitions represent this complexity to some degree and have two sections: root spans and child spans. Let’s look at a simple example of representing a GET request:
- name: "getAccount"
childSpans: ["getAccountHTTPRequest", "getAccountProcessing"]
- name: "getAccountHTTPRequest"
- name: "getAccountProcessing"
childSpans: ["checkAccountCache", "getAccountQuery"]
- name: "checkAccountCache"
- name: "getAccountQuery"
The reasons for having a root span section are:
- We know where to start the tree, giving us the ability to put the same child span in multiple trace requests.
- There are configurations defined at the trace level, such as payload count and payload frequency, which are not shown here.
The flow of this request is:
getAccount -> getAccountHTTPRequest -> getAccountProcessing -> checkAccountCache -> getAccountQuery
Like entities, metrics and logs, spans can have attributes, as shown by the span getAccountHTTPRequest. The error frequency field specified for the getAccountQuery span indicates that every fourth time this span is generated, it will end in an error. Of course, the complexity doesn’t end here. There are further details to consider, such as splitting a trace into different payloads, and how copy counts behave for traces. But for the sake of brevity, we’ll wrap this up here and continue our effort in part two of this blog, which includes a link to detailed documentation at the end.