Skip to content

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Aug 6, 2025

Adds a new field type and mapper for OpenTelemetry exponential histograms.

The implementation was heavily inspired (aka partly copy-pasted) from the existing HistogramFieldMapper.
Aggregations are not (yet) supported, but documents with this field type can be stored and loaded. I also did some manual testing with Discover in Kibana and things look fine there.

The new type looks like this:

{ "scale": 12, "zero": { "threshold": 0.123456, "count": 42 }, "positive": { "indices": [-1000000, -10, 25, 26, 99999999], "counts": [1, 2, 3, 4, 5] }, "negative": { "indices": [-1000, -999, 123456], "counts": [10, 20, 11] } } 

Every sub-field except for scale is optional.

This is not the most readable with the indices array, but this sparse representation is required, because a dense representation might not be possible for e.g. aggregation results.

However, if desired we could support a dense representation as additional input format:

{ "scale": 12, "zero": { "threshold": 0.123456, "count": 42 }, "positive": { "offset": -12345, "counts": [1, 2, 3, 0, 0, 0, 0, 4, 5] }, "negative": { "offset": 6789, "counts": [10, 0, 20, 0, 0, 0, 11] } } 

This would be exactly the opentelemetry representation. However, this adds some complexity:

  • External consumers would need to understand both notations
  • We'd need some heuristic on when to choose which format for synthetic source / query results. E.g. if at least 80% of buckets are empty, use sparse. Otherwise use dense.

I've not implemented this in this PR, as it is not strictly required. However, it would be great if you could give some feedback if we want to support this notation or not.

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 6, 2025
@JonasKunz JonasKunz marked this pull request as ready for review August 8, 2025 12:14
@JonasKunz JonasKunz requested a review from not-napoleon August 8, 2025 12:14
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Aug 8, 2025
@felixbarny felixbarny added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Aug 8, 2025
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Aug 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@JonasKunz JonasKunz removed the request for review from martijnvg August 14, 2025 08:56
# Conflicts: #	benchmarks/src/main/java/org/elasticsearch/benchmark/exponentialhistogram/ExponentialHistogramMergeBench.java #	libs/exponential-histogram/src/main/java/org/elasticsearch/exponentialhistogram/ZeroBucket.java #	libs/exponential-histogram/src/test/java/org/elasticsearch/exponentialhistogram/ExponentialHistogramMergerTests.java
@JonasKunz JonasKunz requested a review from kkrik-es August 15, 2025 12:56
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the encoding and how it gets stored and left a question around that.

CompressedExponentialHistogram.writeHistogramBytes(histogramBytesOutput, scale, negativeBuckets, positiveBuckets);
BytesRef histoBytes = histogramBytesOutput.bytes().toBytesRef();

Field histoField = new BinaryDocValuesField(fullPath(), histoBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using binary doc values today means this would never be compressed (across multiple documents). I suspect the storage would be high for exponential histogram? Did we consider alternatives?

Looks like the encoding logic kind of stores a number of IndexWithCount. Did we look into encoding each index and count pair using sorted numeric doc values? We can have two sorted number doc values, one count index and one for counts. Then we can have another sorted number document values that keeps track which count belongs to what index. Maybe this compresses better? We would need a similar structure for negative buckets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from HistogramFieldMapper which also stores tuples of (value, count) as encoded binary doc values with the difference that value is a double. For that reason I assumed that this is the best option for storing tuples.

I don't quite understand how your suggested approach would work.

Let's say we have the following (index, count) tuples to store: [(1, 10), (2, 30), (3, 20)].
The SortedNumericDocValuesField for index would then store 1, 2, 3 and for the counts it would be 10, 20, 30.

How would we store the relation between those two? We'd need a "position"-array storing the position within the counts for each index, which in this example would be 0,2,1. I assumed there is no way of storing this, because SortedNumericDocValues would reorder here?

Copy link
Member

@martijnvg martijnvg Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorted number doc values re-orders numbers at the document level (lowest to highest). We would need an extra doc values field to keep track of original order. We can stored as bytes maybe in a SortedDocvalues.

Something similar was done for keep track of original array order for sythetic source if we synthesize from doc values. See FieldArrayContext and how it gets used.

I think this could reduce storage required for exponential histograms, but we would need to test it. The current format is much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in order to test it would could do a small experiment by generating some semi-realistic exponential histograms and direct store then into a lucene index with both formats and than compare index size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this adds the new field mapper behind a feature flag, we can also explore this in a follow up. But I like to look into this before the feature flag gets removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the experiment results:

  • The custom encoding stored in a single BinaryDocValue yields 1.59 bytes per populated bucket
  • Storing indices and counts each in SortedNumericDocValues with FieldArrayContext gives 3.34 bytes per bucket
  • The custom encoding stored in a single SortedDocValue yields 1.2 bytes, presumably due to LZ4 compression of the dictionary.

So I suggest we go with the custom encoding, but stored in a SortedDocValuesField for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this (branch), but this causes a weird test failure, which seems to be fixed by this lucene change.

So I guess we should do the switch to SortedDocValuesField later after upgrading to lucene 10.3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a terms dictionary for this kinda feels wrong. It adds more ingest overhead to create and de-duplicate the dictionary and adds another layer of indirection. But the compression does look compelling. As commented above, I think it would be interesting to try compressing values with (double) delta encoding to see if it brings benefits. Also, maybe there's a way to compress binary doc values with lz4 or zstd somehow, without having to create a terms dictionary?

private int negativeBucketsLength;
private int positiveBucketsLength;

void load(BytesRef data) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: decode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a factory method, to avoid mutations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done intentionally to avoid allocation when iterating over stored histograms. The same pattern is used in HistogramFieldMapper.

If this was a factory method, this would also give a wrong sense of lifetime/safety IINM: The byte array backing binaryValue() returned from BinaryDocValues is reused internally when advancing to the next method. So if we want to avoid mutation side effects here, we'd need to copy the bytes value, which we want to avoid for performance reasons.

JonasKunz and others added 6 commits August 19, 2025 11:47
…lasticsearch/xpack/exponentialhistogram/EncodedHistogramData.java Co-authored-by: Kostas Krikellas <131142368+kkrik-es@users.noreply.github.com>
…lasticsearch/xpack/exponentialhistogram/EncodedHistogramData.java Co-authored-by: Kostas Krikellas <131142368+kkrik-es@users.noreply.github.com>
…-mapper # Conflicts: #	x-pack/plugin/mapper-exponential-histogram/src/main/java/org/elasticsearch/xpack/exponentialhistogram/EncodedHistogramData.java
}
long minIndex = buckets.getFirst().index();
out.writeZLong(minIndex);
long prevIndex = minIndex - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need - 1 here and below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is to avoid the special case of writing out the first bucket. But I can see how this can be confusing. I'll instead write out the count of the first bucket explicitly and start the for loop below at the second list element. This should make this more clear.

Below I though it should be self explanatory:
If you have a bucket at index 42 and the next bucket is 44, the indexDelta is 44-42=2. The number of empty buckets inbetween is 2-1 = 1.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

CompressedExponentialHistogram.writeHistogramBytes(histogramBytesOutput, scale, negativeBuckets, positiveBuckets);
BytesRef histoBytes = histogramBytesOutput.bytes().toBytesRef();

Field histoField = new BinaryDocValuesField(fullPath(), histoBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a terms dictionary for this kinda feels wrong. It adds more ingest overhead to create and de-duplicate the dictionary and adds another layer of indirection. But the compression does look compelling. As commented above, I think it would be interesting to try compressing values with (double) delta encoding to see if it brings benefits. Also, maybe there's a way to compress binary doc values with lz4 or zstd somehow, without having to create a terms dictionary?

JonasKunz and others added 2 commits August 20, 2025 10:38
Co-authored-by: Felix Barnsteiner <felixbarny@users.noreply.github.com>
@JonasKunz JonasKunz merged commit 23ee3b6 into elastic:main Aug 20, 2025
34 checks passed
@JonasKunz JonasKunz deleted the exp-histo-mapper branch August 20, 2025 10:58
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 20, 2025
…improv * upstream/main: (58 commits) Fixing flaky LoggedExec (tests) (elastic#133215) CPS search should not use `skip_unavailable` (elastic#132927) Don't fail search if bottom doc can't be formatted (elastic#133188) Mute org.elasticsearch.xpack.esql.action.RandomizedTimeSeriesIT testGroupByNothing elastic#133225 Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedSetDocValuesWithSkipperSmall elastic#133224 Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedNumberMergeAwayAllValuesWithSkipper elastic#133223 Adding support for index.number_of_replicas to data stream settings (elastic#132748) Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedDocValuesSingleUniqueValue elastic#133221 Fix VectorSimilarityFunctionsIT (elastic#133206) Mute org.elasticsearch.xpack.esql.action.RandomizedTimeSeriesIT testGroupBySubset elastic#133220 Increase the number of FORK branches in ForkGenerator (elastic#132019) Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=STORED]} elastic#133218 Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=DOC_VALUES]} elastic#133217 Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=NONE]} elastic#133216 Set default processor allocation for test clusters (elastic#133204) Add mapper for exponential histograms (elastic#132493) Fix offset handling in Murmur3Hasher (elastic#133193) unmute testDoesNotResolveClosedIndex (elastic#133115) Fix an AWS SDK v2 release note (elastic#133155) Limit the depth of a filter (elastic#133113) ...
szybia added a commit to szybia/elasticsearch that referenced this pull request Aug 20, 2025
* upstream/main: (58 commits) Fixing flaky LoggedExec (tests) (elastic#133215) CPS search should not use `skip_unavailable` (elastic#132927) Don't fail search if bottom doc can't be formatted (elastic#133188) Mute org.elasticsearch.xpack.esql.action.RandomizedTimeSeriesIT testGroupByNothing elastic#133225 Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedSetDocValuesWithSkipperSmall elastic#133224 Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedNumberMergeAwayAllValuesWithSkipper elastic#133223 Adding support for index.number_of_replicas to data stream settings (elastic#132748) Mute org.elasticsearch.index.codec.tsdb.es819.ES819TSDBDocValuesFormatTests testSortedDocValuesSingleUniqueValue elastic#133221 Fix VectorSimilarityFunctionsIT (elastic#133206) Mute org.elasticsearch.xpack.esql.action.RandomizedTimeSeriesIT testGroupBySubset elastic#133220 Increase the number of FORK branches in ForkGenerator (elastic#132019) Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=STORED]} elastic#133218 Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=DOC_VALUES]} elastic#133217 Mute org.elasticsearch.index.mapper.blockloader.IpFieldBlockLoaderTests testBlockLoader {preference=Params[syntheticSource=true, preference=NONE]} elastic#133216 Set default processor allocation for test clusters (elastic#133204) Add mapper for exponential histograms (elastic#132493) Fix offset handling in Murmur3Hasher (elastic#133193) unmute testDoesNotResolveClosedIndex (elastic#133115) Fix an AWS SDK v2 release note (elastic#133155) Limit the depth of a filter (elastic#133113) ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v9.2.0

5 participants