- Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Metricbeat and Elastic Agent ship multiple metrics with the same dimensions in a single document to Elasticsearch (example system memory doc). Having multiple metrics in a single document was the preferred way for Elasticsearch when Metricbeat was created. It is also nice for users to see in a single document many related metrics. But having many metrics in a single document only works as long as we are in control of the collection.
Many time series data bases think of a single metric with labels as a time serie. For example prometheus defines it as following:
Every time series is uniquely identified by its metric name and optional key-value pairs called labels.
During the collection of prometheus and otel metrics, Metricbeat tries to group together all the metrics with the same labels to not create one document per metric. In some cases this works well, in others not so much. In our data model, labels are used mostly for metadata, in prometheus labels are used for dimensions of metrics where we instead used the metric names for it.
Looking at the following prometheus example, {collect=*} is a label for the metric node_scrape_collector_duration_seconds. Each value of the label is unique to the metric so our conversion creates one document for each entry.
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape. # TYPE node_scrape_collector_duration_seconds gauge node_scrape_collector_duration_seconds{collector="boottime"} 3.007e-05 node_scrape_collector_duration_seconds{collector="cpu"} 0.001108799 node_scrape_collector_duration_seconds{collector="diskstats"} 0.001491901 node_scrape_collector_duration_seconds{collector="filesystem"} 0.001846135 node_scrape_collector_duration_seconds{collector="loadavg"} 0.001142997 node_scrape_collector_duration_seconds{collector="meminfo"} 0.000129408 node_scrape_collector_duration_seconds{collector="netdev"} 0.004051806 node_scrape_collector_duration_seconds{collector="os"} 0.00528108 node_scrape_collector_duration_seconds{collector="powersupplyclass"} 0.001323725 node_scrape_collector_duration_seconds{collector="textfile"} 0.001216545 node_scrape_collector_duration_seconds{collector="thermal"} 0.00207716 node_scrape_collector_duration_seconds{collector="time"} 0.001271639 node_scrape_collector_duration_seconds{collector="uname"} 0.000994093 This would result in documents similar to:
"prometheus": { "labels": { "instance": "localhost:9100", "job": "prometheus", "collector: "boottime" }, "metrics": { "node_scrape_collector_duration_seconds": 3.007e-05 } }In our metrics format it would be more something like:
"metrics": { "labels": { "instance": "localhost:9100", "job": "prometheus", }, "node_scrape_collector_bootime_duration_seconds": 3.007e-05 "node_scrape_collector_cpu_duration_seconds": 0.001108799 ... }As you see above, the label made it into the metric name but not necessarily in a predicable place. This becomes more obvious when there are many labels:
# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes. # TYPE node_filesystem_avail_bytes gauge node_filesystem_avail_bytes{device="//ruflin@111.111.111.111/ruflin",fstype="smbfs",mountpoint="/Volumes/.timemachine/111.111.111.111/212-1FFF-51FF-84F2/ruflin"} 3.952268886016e+12 node_filesystem_avail_bytes{device="/dev/disk1s1s1",fstype="apfs",mountpoint="/"} 6.62352285696e+11 node_filesystem_avail_bytes{device="/dev/disk1s2",fstype="apfs",mountpoint="/System/Volumes/Data"} 6.62352285696e+11 node_filesystem_avail_bytes{device="/dev/disk1s3",fstype="apfs",mountpoint="/System/Volumes/Preboot"} 6.64204152832e+11 n ... How should we group this magically into a single document? Even in our scenario we send 1 document per file system but we group together multipel metrics for the same file system.
Push metrics
In more and more scenarios like otel, prometheus remote write, we don't collect the metrics but metrics are pushed to us. In these scenarios, we loose control around what metrics we receive in which order. Some caching would still allow us to group together some of the metrics but as described above, not all. The number of documents increases further.
Too many docs
The number of metric documents keeps increases compared to Metricbeat 10x to 100x even though we combine metrics and the number of metrics per document keeps decreasing. In some scenarios we reached 1 metric per document. But historically Elasticsearch did not like the overhead of single metrics docs.
I'm creating this issue to bring awareness to the Elasticsearch around this issue and discuss potential solutions.