AppDynamics has been increasing the use of ElasticSearch to implement real-time analytics over a vast number of data points. Among all, we record each visit made by an end-user to every web site monitored by our product. Each such record contains a URL, a server response time, a page render time, browser type, approximate user location, etc. A number of experiments with real and synthetic data were performed to come up with guidelines to better disk usage. We concluded that keeping string values as short as possible helps. It’s a good idea even if the cardinality of values in the column is low. Low cardinality does not reliably lead to low disk usage.
Long-term trend of storage costs
When measuring ElasticSearch (ES) storage usage, it is important to realize that the short-term trend does not represent a long-term average. In fact, the short-term trend of the per-record cost (writes of 1M or less records) can be as much as 3x more than the long-term cost (10M+ records). Our experiments suggest that to get good approximation of the long-term trend, one should be looking at writes of at least 10M-20M records.
This happens because every now and then (in our experiments, once every 1M records), ES goes through a major internal storage compacting phase. This has to do with the fact that ES partitions incoming data by time into so-called “segments”. New records and updates are put first into new, small segments. The most recently created segments are gradually merged and re-written into larger segments, which allows ES to store them more efficiently. More information on segments can be found here and here.
On the graph below you can see the disk storage cost (refer to blue plot) in an experiment where we write 20M+ records, randomly picked from a sample of 100K browser records of a real customer production data. You can notice large storage size dips once every about 1M writes. You can also notice nearly linear slopes between major dips. These slopes are about 3x steeper than the long-term trend.
The green plot indicates self-reported ES storage size, available through ES REST API under /_stats end-point. The blue plot indicates disk cost directly obtained by running du -sk in the ES data directory on the machine, where the experiment was performed. The slope of the blue plot, calculated by running linear regression on blue data, indicates that we are using around 1157 bytes per data point in larger database.
Guessing long-term average per-record costs without writing too many records
Writing 20M+ records, needed to establish average long-term storage costs can take hours. Our optimized code to insert records (in a pool of 25 parallel threads) could write only up to 3-4M records per hour, so obtaining a single data point involving 20M records could take 5-6 hours. To work around this it is possible to force ES to merge all data in the index into one segment. This can be triggered by hitting special URL:
See more details here.
For the sake of further experiments we assumed that writing about 100K records and forcing all segments to merge should be good enough for further experiments. It should at least allow us to observe trends that lead to more efficient disk usage, when we run many similar experiments.
We started from performing a number of somewhat narrow experiments, where we manipulated only one experimental parameter at a time, such as the number of properties (columns) per document. We observed the impact of that parameter on the average document size. Unfortunately, that always left us with blind spots. We could not be sure if the trend is dependent or independent of other parameters. For example, is the trend describing average record cost the same for low cardinality and high cardinality columns? We felt like we were not moving closer to a general formula which would cover all corner cases.
So we decided to take a different approach. To see the big picture, we ran an experiments generator to perform hundreds of similar experiments, where we tried random combinations of multiple experimental parameters at once, rather than picking different values of just one parameter, while other parameters are fixed.
More specifically, in each experiment we created a new ES index (database) and wrote 100K records into it. Each record has a number of string properties, controlled by parameter numProperties (from 1 to 50). Each string is random sequence of letters. The length of every string is equal to experiment parameter called strLen (from 1 to 500). Lastly, the number of unique values for each property (column) is limited by the parameter called cardinality (from 2 to 200). A cardinality=2, for example, means that the property value is chosen from one of 2 possibilities. Each property has its own set of possible values but all sets are of the same size. Moreover, all string properties are defined as not_analyzed in ElasticSearch schema, since we typically don’t use natural language search features of ElasticSearch.
Here’s a graph representing average document size depending on all 3 parameters. The color indicates cardinality, warmer colors indicate higher cardinality, and colder colors indicate lower cardinality.
We can guess that document size is roughly proportional to the total length of all string values – the product of strLen and numProperties. Let’s test that hypothesis:
Average cost of storing strings, bytes per character
To test our guess about the relation between document size to the choice of parameters we can plot this ratio:
actual document size / predicted document size
In this case, the predicted document size is strLen*numProperties. In other words, we are looking at how many bytes on average it takes to represent one character. If the ratio is around 1, then our guess was good.
Looks like our guess was good–for most of the domain. The majority of data points land between ratio equal to about 1 and 1.5. The approximation turns out to be pretty good especially for string lengths greater than about 100 and number of properties more than about 10.
Where the approximation does not work? Firstly, for small number of properties, at about 10-20 and less, ES is sometimes able to save significant amount of space for low cardinality properties (small number of unique values). See blue dots below ratio 1 on the bottom right graph. In some cases we need much less than 0.5 bytes per character. Interestingly, the cost per character does not seem to go down with the strLen (see blue dots on the bottom left graph). This is a bit unexpected result, since the values can be encoded in constant space (as enum integers) for low cardinality properties. We suspect this result has to do with the fact that we did not disable the _source field for the documents.
Secondly, the per-character cost goes up significantly (even as high as 7 bytes per character) if the strings are short. This can be accounted to some constant per-value cost, independent of the length of the string. If we assume about 8 bytes of constant cost per string value, we get much better approximation. Here’s the graph when we use
predicted document size = (strLen + 8) * numProperties [bytes]
The approximation taking constant cost into account is clearly better than the previous one. The ratio is still centered around 1. The actual cost is no more than about 1.6x of the predicted cost, vs almost 8x with the previous approximation.
Based on the performed experiments, we concluded that the best chance of saving some ES storage space involves making all the strings as short as possible, even for properties (columns) having small number of unique values (low cardinality). We can count on savings roughly proportional to the number of characters eliminated from string values (in bytes).