- Notifications
You must be signed in to change notification settings - Fork 25.7k
Store high-cardinality keyword fields in binary doc values #138548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store high-cardinality keyword fields in binary doc values #138548
Conversation
| Hi @jordan-powers, I've created a changelog YAML for you. |
| if (hasValue) { | ||
| for (int i = 0; i < sortedBinaryDocValues.docValueCount(); i++) { | ||
| BytesRef bytesRef = sortedBinaryDocValues.nextValue(); | ||
| emit(bytesRef.utf8ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the conversion from ut8 to utf16 and then back here is a significant overhead on wildcard/regex queries.
It worth removing this roundtrip, though probably belongs in a follow-up. Here's a hacky approach I made to fix this: parkertimmins@fa13b3b#diff-cf9d201e04fb4fd754a3981f450cda5e68c551392e781a78aaa4ef8ccc48bccd
Another idea is that maybe you could use BinaryDvConfirmedQuery. This operates on BytesRefs directly so probably does not have this round trip. (Though I'm not really sure if it makes sense to use)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into that, thanks! Although if it requires more than a couple of lines to fix, I think it'll be best left as a follow-up. This PR is getting long enough as-is
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved
| The reason for the separate To solve this, we need two distinct FieldTypes. The first (named |
| Switching from >feature to >non-issue because the new parameter is hidden behind a feature flag. |
| * Wrapper around {@link BinaryDocValues} to decode the typical multivalued encoding used by | ||
| * {@link org.elasticsearch.index.mapper.BinaryFieldMapper.CustomBinaryDocValuesField}. | ||
| */ | ||
| public class MultiValuedSortedBinaryDocValues extends SortedBinaryDocValues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these actually sorted? Extending SortedBinaryDocValues is somewhat misleading if they're not, but also I get it - I also extended it due to the convenient methods it provides. If these are not sorted, lets change the name of the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if natural sorting ordering in MultiValuedBinaryDocValuesField is used then each per document value is sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats true. But, why do we need them in natural sort order? Wouldn't that mix up the order of elements in the synthesized document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is consistent with how sorted set doc values work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I figured that since we're removing duplicates anyway, we should sort so that we're consistent with SortedSet doc values. This way the synthetic_source is consistent between the two implementations, and also when we add offset tracking for synthetic_source_keep: arrays for binary doc values, we can reuse much of the same logic as offset tracking for sorted set doc values.
| scratch.length = in.readVInt(); | ||
| scratch.offset = in.getPosition(); | ||
| in.setPosition(scratch.offset + scratch.length); | ||
| return scratch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just call in.readBytesRef()? The input stream should already know how to read the next value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is possible? Given that this method assumes a zero offset? Also then we can't reuse a scratch instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readBytesRef() also uses a zero offset internally:
public BytesRef readBytesRef() throws IOException { int length = readArraySize(); return readBytesRef(length); } public BytesRef readBytesRef(int length) throws IOException { if (length == 0) { return new BytesRef(); } byte[] bytes = new byte[length]; readBytes(bytes, 0, length); return new BytesRef(bytes, 0, length); } There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also, why do we need to preallocate a scratch object, when we're returning one value at a time? Doesn't that consume unnecessary space? If we're returning one value at a time, readBytesRef() should be able to handle that for us - it assumes each value is delimited by its length, so it'll call VInt followed by readBytes.
In any case, its hard to tell with these byte arrays without actually seeing what they look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pulled this class out from an anonymous implementation in AbstractBinaryDVLeafFieldData. Unless there's a compelling reason, I'm reluctant to modify it too much since it's been running in production for a while so we know it works as-is.
I think the idea with the scratch object is that the same instance is returned for each call to nextValue(), just with the offset/length modified. I think this is an optimization to reduce the number of allocations/garbage collections.
| | ||
| import java.io.IOException; | ||
| | ||
| public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use my implementation from here. It includes nice comments explaining how binary doc values are encoded. If this implementation and mine don't quite align, lets at least copy over the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think functionally both implementations allign? Only this implementaton relies on SortedBinaryDocValues whereas in your implementation that isn't the case. Could the implementation in the mentioned PR rely on SortedBinaryDocValues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I can change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my goals on this PR was to reuse MultiValuedSortedBinaryDocValues everywhere so that I'm not reimplementing the decoding logic.
| setValue(Values.DISABLED); | ||
| } | ||
| } else if (value instanceof String) { | ||
| if (value.equals("true")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use equalsIgnoreCase() here?
Imo, its also a good idea to flip the check around in case value is null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ignore case here. It will allow for leniency. But flipping the check is a good idea.
Or maybe we can use XContentMapValues.nodeBooleanValue(...) which is consistent with how doc_values mapping attribute get parsed today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know about XContentMapValues.nodeBooleanValue(...). I refactored this in c77b08d to use it.
| if (value.enabled == false) { | ||
| builder.field(name, false); | ||
| } else if (value.equals(getDefaultValue())) { | ||
| builder.field(name, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we pass in getDefaultValue() instead of true? Suppose the default changes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed in 0b2dee5
| assertScriptDocValues(mapper, "foo", equalTo(List.of("foo"))); | ||
| } | ||
| | ||
| public void testDocValuesLowCardinality() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add another test case where cardinality is an invalid string
martijnvg left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jordan! I left a few comments, but this looks good.
server/src/main/java/org/elasticsearch/index/fielddata/plain/BytesBinaryIndexFieldData.java Show resolved Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved
| @SuppressWarnings("unchecked") | ||
| Map<String, Object> valueMap = (Map<String, Object>) value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can be removed and instead this is sufficient at line 1462:
} else if (value instanceof Map<?, ?> valueMap) { ?
| setValue(Values.DISABLED); | ||
| } | ||
| } else if (value instanceof String) { | ||
| if (value.equals("true")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ignore case here. It will allow for leniency. But flipping the check is a good idea.
Or maybe we can use XContentMapValues.nodeBooleanValue(...) which is consistent with how doc_values mapping attribute get parsed today?
| * Wrapper around {@link BinaryDocValues} to decode the typical multivalued encoding used by | ||
| * {@link org.elasticsearch.index.mapper.BinaryFieldMapper.CustomBinaryDocValuesField}. | ||
| */ | ||
| public class MultiValuedSortedBinaryDocValues extends SortedBinaryDocValues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if natural sorting ordering in MultiValuedBinaryDocValuesField is used then each per document value is sorted?
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved
| scratch.length = in.readVInt(); | ||
| scratch.offset = in.getPosition(); | ||
| in.setPosition(scratch.offset + scratch.length); | ||
| return scratch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is possible? Given that this method assumes a zero offset? Also then we can't reuse a scratch instance.
| | ||
| import java.io.IOException; | ||
| | ||
| public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think functionally both implementations allign? Only this implementaton relies on SortedBinaryDocValues whereas in your implementation that isn't the case. Could the implementation in the mentioned PR rely on SortedBinaryDocValues?
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved
| } | ||
| | ||
| public String binaryDocValuesName() { | ||
| return name() + "._dv"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting with @romseygeek about using a different field name just for binary doc values.
We should be able to reuse the same field name. Lucene should be able to handle that and it will merge the field types of MultiValuedBinaryDocValuesField and KeywordField. I quickly checked and the yaml tests in your PR did pass by removing the ._dv suffix here.
If it does work out, then this binaryDocValuesName() can be removed and name() can be used instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know that would work. I just assumed lucene would complain about the conflicting field infos.
I've done as you suggest in 70c1d4b.
…rameters#cardinality()
martijnvg left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jordan! LGTM.
… binary" This reverts commit 387af57.
| Remaining failing release tests are due to unrelated issue #139401. |
This PR adds a mapping parameter to keyword fields
doc_values.cardinality. When this parameter is set tolow(the default), keyword fields will use sorted set doc values as normal. However, when this parameter is set tohigh, keyword fields will instead use binary doc values.This is an optimization to remove the overhead of looking up keyword values by ordinal when the keyword field has high-cardinality.