Store high-cardinality keyword fields in binary doc values #138548

jordan-powers · 2025-11-25T00:33:53Z

This PR adds a mapping parameter to keyword fields doc_values.cardinality. When this parameter is set to low (the default), keyword fields will use sorted set doc values as normal. However, when this parameter is set to high, keyword fields will instead use binary doc values.

This is an optimization to remove the overhead of looking up keyword values by ordinal when the keyword field has high-cardinality.

elasticsearchmachine · 2025-11-25T00:34:41Z

Hi @jordan-powers, I've created a changelog YAML for you.

…ality-param-2

parkertimmins · 2025-11-26T17:12:02Z

server/src/main/java/org/elasticsearch/script/SortedBinaryDocValuesStringFieldScript.java

+ if (hasValue) {
+ for (int i = 0; i < sortedBinaryDocValues.docValueCount(); i++) {
+ BytesRef bytesRef = sortedBinaryDocValues.nextValue();
+ emit(bytesRef.utf8ToString());


I found that the conversion from ut8 to utf16 and then back here is a significant overhead on wildcard/regex queries.

It worth removing this roundtrip, though probably belongs in a follow-up. Here's a hacky approach I made to fix this: parkertimmins@fa13b3b#diff-cf9d201e04fb4fd754a3981f450cda5e68c551392e781a78aaa4ef8ccc48bccd

Another idea is that maybe you could use BinaryDvConfirmedQuery. This operates on BytesRefs directly so probably does not have this round trip. (Though I'm not really sure if it makes sense to use)

I'll look into that, thanks! Although if it requires more than a couple of lines to fix, I think it'll be best left as a follow-up. This PR is getting long enough as-is

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

…ality-param-2

jordan-powers · 2025-12-03T16:29:24Z

The reason for the separate <name>._dv field is to work around the lucene limitation that binary doc values only support a single value per document. To solve this, we use a custom binary format to combine all the values in the document into a single binary blob that we store in the binary doc values. However, we don't want to index this combined blob, we want to index the individual values.

To solve this, we need two distinct FieldTypes. The first (named <field_name>) has docValuesType=NONE and indexOptions=DOCS, and is used to index the individual values. The second (named <field_name>._dv) has docValuesType=BINARY and indexOptions=NONE and is used to store the combined blob.

…ality-param-2

jordan-powers · 2025-12-08T21:30:24Z

Switching from >feature to >non-issue because the new parameter is hidden behind a feature flag.

…ality-param-2

Kubik42 · 2025-12-08T21:59:47Z

server/src/main/java/org/elasticsearch/index/fielddata/MultiValuedSortedBinaryDocValues.java

+ * Wrapper around {@link BinaryDocValues} to decode the typical multivalued encoding used by
+ * {@link org.elasticsearch.index.mapper.BinaryFieldMapper.CustomBinaryDocValuesField}.
+ */
+public class MultiValuedSortedBinaryDocValues extends SortedBinaryDocValues {


Are these actually sorted? Extending SortedBinaryDocValues is somewhat misleading if they're not, but also I get it - I also extended it due to the convenient methods it provides. If these are not sorted, lets change the name of the class?

I think that if natural sorting ordering in MultiValuedBinaryDocValuesField is used then each per document value is sorted?

Yeah thats true. But, why do we need them in natural sort order? Wouldn't that mix up the order of elements in the synthesized document?

I think it is consistent with how sorted set doc values work?

Right I figured that since we're removing duplicates anyway, we should sort so that we're consistent with SortedSet doc values. This way the synthetic_source is consistent between the two implementations, and also when we add offset tracking for synthetic_source_keep: arrays for binary doc values, we can reuse much of the same logic as offset tracking for sorted set doc values.

Kubik42 · 2025-12-08T22:01:49Z

server/src/main/java/org/elasticsearch/index/fielddata/MultiValuedSortedBinaryDocValues.java

+ scratch.length = in.readVInt();
+ scratch.offset = in.getPosition();
+ in.setPosition(scratch.offset + scratch.length);
+ return scratch;


can't you just call in.readBytesRef()? The input stream should already know how to read the next value.

I don't think is possible? Given that this method assumes a zero offset? Also then we can't reuse a scratch instance.

readBytesRef() also uses a zero offset internally:

public BytesRef readBytesRef() throws IOException { int length = readArraySize(); return readBytesRef(length); } public BytesRef readBytesRef(int length) throws IOException { if (length == 0) { return new BytesRef(); } byte[] bytes = new byte[length]; readBytes(bytes, 0, length); return new BytesRef(bytes, 0, length); }

But also, why do we need to preallocate a scratch object, when we're returning one value at a time? Doesn't that consume unnecessary space? If we're returning one value at a time, readBytesRef() should be able to handle that for us - it assumes each value is delimited by its length, so it'll call VInt followed by readBytes.

In any case, its hard to tell with these byte arrays without actually seeing what they look like.

I just pulled this class out from an anonymous implementation in AbstractBinaryDVLeafFieldData. Unless there's a compelling reason, I'm reluctant to modify it too much since it's been running in production for a while so we know it works as-is.

I think the idea with the scratch object is that the same instance is returned for each call to nextValue(), just with the offset/length modified. I think this is an optimization to reduce the number of allocations/garbage collections.

Kubik42 · 2025-12-08T22:04:48Z

...r/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

+
+import java.io.IOException;
+
+public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {


Lets use my implementation from here. It includes nice comments explaining how binary doc values are encoded. If this implementation and mine don't quite align, lets at least copy over the comments.

I think functionally both implementations allign? Only this implementaton relies on SortedBinaryDocValues whereas in your implementation that isn't the case. Could the implementation in the mentioned PR rely on SortedBinaryDocValues?

yeah I can change that

One of my goals on this PR was to reuse MultiValuedSortedBinaryDocValues everywhere so that I'm not reimplementing the decoding logic.

Kubik42 · 2025-12-08T22:10:53Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+ setValue(Values.DISABLED);
+ }
+ } else if (value instanceof String) {
+ if (value.equals("true")) {


should we use equalsIgnoreCase() here?

Imo, its also a good idea to flip the check around in case value is null

I don't think we should ignore case here. It will allow for leniency. But flipping the check is a good idea.
Or maybe we can use XContentMapValues.nodeBooleanValue(...) which is consistent with how doc_values mapping attribute get parsed today?

Oh, I didn't know about XContentMapValues.nodeBooleanValue(...). I refactored this in c77b08d to use it.

Kubik42 · 2025-12-08T22:13:46Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+ if (value.enabled == false) {
+ builder.field(name, false);
+ } else if (value.equals(getDefaultValue())) {
+ builder.field(name, true);


should we pass in getDefaultValue() instead of true? Suppose the default changes in the future.

Good point. Fixed in 0b2dee5

Kubik42 · 2025-12-08T22:27:11Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

 assertScriptDocValues(mapper, "foo", equalTo(List.of("foo")));
 }

+ public void testDocValuesLowCardinality() throws IOException {


lets add another test case where cardinality is an invalid string

martijnvg

Thanks Jordan! I left a few comments, but this looks good.

server/src/main/java/org/elasticsearch/index/fielddata/plain/BytesBinaryIndexFieldData.java

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

martijnvg · 2025-12-09T09:10:07Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+ @SuppressWarnings("unchecked")
+ Map<String, Object> valueMap = (Map<String, Object>) value;


I think can be removed and instead this is sufficient at line 1462:

} else if (value instanceof Map<?, ?> valueMap) {

?

martijnvg · 2025-12-09T09:12:48Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+ setValue(Values.DISABLED);
+ }
+ } else if (value instanceof String) {
+ if (value.equals("true")) {


I don't think we should ignore case here. It will allow for leniency. But flipping the check is a good idea.
Or maybe we can use XContentMapValues.nodeBooleanValue(...) which is consistent with how doc_values mapping attribute get parsed today?

martijnvg · 2025-12-09T09:21:59Z

server/src/main/java/org/elasticsearch/index/fielddata/MultiValuedSortedBinaryDocValues.java

+ * Wrapper around {@link BinaryDocValues} to decode the typical multivalued encoding used by
+ * {@link org.elasticsearch.index.mapper.BinaryFieldMapper.CustomBinaryDocValuesField}.
+ */
+public class MultiValuedSortedBinaryDocValues extends SortedBinaryDocValues {


I think that if natural sorting ordering in MultiValuedBinaryDocValuesField is used then each per document value is sorted?

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

martijnvg · 2025-12-09T09:28:13Z

server/src/main/java/org/elasticsearch/index/fielddata/MultiValuedSortedBinaryDocValues.java

+ scratch.length = in.readVInt();
+ scratch.offset = in.getPosition();
+ in.setPosition(scratch.offset + scratch.length);
+ return scratch;


I don't think is possible? Given that this method assumes a zero offset? Also then we can't reuse a scratch instance.

martijnvg · 2025-12-09T09:31:30Z

...r/src/main/java/org/elasticsearch/index/mapper/BinaryDocValuesSyntheticFieldLoaderLayer.java

+
+import java.io.IOException;
+
+public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {


I think functionally both implementations allign? Only this implementaton relies on SortedBinaryDocValues whereas in your implementation that isn't the case. Could the implementation in the mentioned PR rely on SortedBinaryDocValues?

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

martijnvg · 2025-12-09T10:39:51Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+ }
+
+ public String binaryDocValuesName() {
+ return name() + "._dv";


I was chatting with @romseygeek about using a different field name just for binary doc values.

We should be able to reuse the same field name. Lucene should be able to handle that and it will merge the field types of MultiValuedBinaryDocValuesField and KeywordField. I quickly checked and the yaml tests in your PR did pass by removing the ._dv suffix here.

If it does work out, then this binaryDocValuesName() can be removed and name() can be used instead.

Oh, I didn't know that would work. I just assumed lucene would complain about the conflicting field infos.

I've done as you suggest in 70c1d4b.

…rameters#cardinality()

…ality-param-2

martijnvg

Thanks Jordan! LGTM.

…ality-param-2

… binary" This reverts commit 387af57.

…ality-param-2

jordan-powers · 2025-12-11T20:51:30Z

Remaining failing release tests are due to unrelated issue #139401.
Bypassing and merging.

jordan-powers added 5 commits November 23, 2025 17:21

Add doc_values.cardinality paramter to keyword fields

7f1eda4

Add new cardinality parameter to randomized testing

43c980e

Store high-cardinality keyword fields in binary doc values

64b78f3

Fix queries

5d5c63a

Fix blockloaders

36ffba4

jordan-powers self-assigned this Nov 25, 2025

jordan-powers added >feature :StorageEngine/Mapping The storage related side of mappings labels Nov 25, 2025

elasticsearchmachine added the v9.3.0 label Nov 25, 2025

jordan-powers and others added 7 commits November 24, 2025 16:34

Update docs/changelog/138548.yaml

0db280f

Fix parsing of doc values parameter

10f07cd

Consolidate BinaryDVLeafFieldData

9f60db7

Gate binary doc values search tests behind node feature

c5123a0

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

410e77f

…ality-param-2

Add javadoc

6c3b14b

Fix compile error

c1965a1

parkertimmins reviewed Nov 26, 2025

View reviewed changes

jordan-powers added 3 commits November 26, 2025 12:02

Fix BinaryDocValuesSyntheticFieldLoaderLayer#valueCount

1111dce

Fix ClassCastException in blockloader tests

05a2dce

Fix multivalued fields

91ea2ad

martijnvg reviewed Dec 1, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved

martijnvg reviewed Dec 1, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

31edd0d

…ality-param-2

jordan-powers added 4 commits December 3, 2025 14:11

Fix syntheticSourceSupport logic

8a706b2

Fix BinaryDocValuesSyntheticFieldLoaderLayer#valueCount

70e2306

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

bab9834

…ality-param-2

Disable doc values skippers when doc values are disabled

7eeb977

jordan-powers added the test-release Trigger CI checks against release build label Dec 6, 2025

jordan-powers added the >non-issue label Dec 8, 2025

Delete docs/changelog/138548.yaml

5efd9ae

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

23ca35f

…ality-param-2

Kubik42 reviewed Dec 8, 2025

View reviewed changes

martijnvg reviewed Dec 9, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved

martijnvg reviewed Dec 9, 2025

View reviewed changes

jordan-powers added 10 commits December 9, 2025 09:07

Reset count in MultiValuedSortedBinaryDocValues when no value

cb7b36c

Refactor DocValuesParameter#parse()

c77b08d

Only serialize to boolean when FF is disabled

0b2dee5

Use same field name for binary doc values and indexed values

70c1d4b

Disable supportsBlockLoaderConfig for binary doc values

af6dcb3

Update comment explaining skipping indexing the keyword field

ae4eed6

Use KeywordFieldType#storedInBinaryDocValues() instead of DocValuesPa…

60ddb16

…rameters#cardinality()

Add test for invalid cardinality parameter value

b5b8ad9

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

8dff50f

…ality-param-2

Add TODO for sorts on binary fields

3e21b68

martijnvg approved these changes Dec 10, 2025

View reviewed changes

jordan-powers added 8 commits December 10, 2025 08:52

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

398e8a3

…ality-param-2

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

bb92694

…ality-param-2

Respect mapper-level synthetic_source_keep

9c03edd

Disable delegating to keyword multi-field when doc values are binary

387af57

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

788b0ae

…ality-param-2

Revert "Disable delegating to keyword multi-field when doc values are…

cd4366a

… binary" This reverts commit 387af57.

Remove invalid assertion

7c68e9f

Merge remote-tracking branch 'upstream/main' into keyword-high-cardin…

8b1df50

…ality-param-2

jordan-powers merged commit 7cabe1a into elastic:main Dec 11, 2025
33 of 36 checks passed

jordan-powers deleted the keyword-high-cardinality-param-2 branch December 13, 2025 00:15


		import java.io.IOException;

		public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {

		@SuppressWarnings("unchecked")
		Map<String, Object> valueMap = (Map<String, Object>) value;

Store high-cardinality keyword fields in binary doc values #138548

Store high-cardinality keyword fields in binary doc values #138548

Uh oh!

Conversation

jordan-powers commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Nov 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordan-powers commented Dec 3, 2025

jordan-powers commented Dec 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jordan-powers Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

jordan-powers commented Dec 11, 2025

Uh oh!

Labels

5 participants

jordan-powers commented Nov 25, 2025 •

edited

Loading

jordan-powers Dec 9, 2025 •

edited

Loading