Skip to content

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Sep 30, 2024

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

  • currently only doc values based array support for keyword field mapper.
  • multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
  • arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].

These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.
@martijnvg martijnvg added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Sep 30, 2024

public int getArrayValueCount(String field) {
if (numValuesByField.containsKey(field)) {
return numValuesByField.get(field) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numValuesByField returns the last offset, in order to get count that returned value needs to be incremented by one.

numValuesByField should be names something else.

}

public void recordOffset(String fieldName, String value) {
int count = numValuesByField.compute(fieldName, (s, integer) -> integer == null ? 0 : ++integer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to comment above, maybe 1 here instead of 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count is the wrong name here. It is like the next offset to be used.

ord++;
}

logger.info("values=" + values);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe debug?

ords[i] = dv.nextOrd();
}

logger.info("ords=" + Arrays.toString(ords));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are using info just for debugging while in draft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was for debugging purposes. This will be removed.

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very powerful idea. Some thoughts:

  • When we generalize this, we need to think how to fit it into existing code nicely (leave poor DocumentParser alone). I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.
  • I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".
}
}

public void processOffsets(DocumentParserContext context) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called from postParse? It is possible that a field is indexed multiple times in one document with object arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised randomized tests don't complain about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think postParse is a better place to invoke this process offsets logic.

@martijnvg
Copy link
Member Author

martijnvg commented Oct 4, 2024

Thanks for taking a look @lkts!

I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.

I think we already have something like this via the FieldMapper#parsesArrayValue() flag. I initially tried using this, because I don't like introducing more complexity in DocumentParser. However it didn't work out in all cases. I recall that tests using copy to failed. For some reason field that overwrite that method and return true, are not taken into account with copy to. Maybe we should make field mappers that chose to overwrite FieldMapper#parsesArrayValue() work with copy_to correctly first.

I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

Good point, I think we can have an info byte where this kind of information can be encoded. I was thinking something similar earlier, but left this out of this draft PR in order to keep it simple for demonstration purposes.

@martijnvg
Copy link
Member Author

@salvatore-campagna and @lkts I made a few changes and keyword field mapper now overwrites parsesArrayValue(), so that it can parse arrays. This allows minimizing changes in DocumentParser.

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 20, 2025
… source Backporting elastic#113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 20, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
elasticsearchmachine pushed a commit that referenced this pull request Feb 20, 2025
… source (#122997) Backporting #113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 21, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 21, 2025
Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit that referenced this pull request Feb 25, 2025
…22999) Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 25, 2025
Backporting elastic#122999 to 8.x branch. Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.
elasticsearchmachine pushed a commit that referenced this pull request Feb 25, 2025
…ce (#123405) * [8.x] Store arrays offsets for ip fields natively with synthetic source Backporting #122999 to 8.x branch. Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source. * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
jordan-powers added a commit that referenced this pull request Mar 20, 2025
…4594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Mar 21, 2025
…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.
elasticsearchmachine pushed a commit that referenced this pull request Mar 21, 2025
#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: #	server/src/main/java/org/elasticsearch/index/IndexVersions.java #	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: #	muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)
jordan-powers added a commit that referenced this pull request Mar 25, 2025
#125529) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.
elasticsearchmachine pushed a commit that referenced this pull request Mar 25, 2025
#125529) (#125596) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`. (cherry picked from commit af1f145) # Conflicts: #	server/src/main/java/org/elasticsearch/index/IndexVersions.java #	server/src/main/java/org/elasticsearch/index/mapper/BooleanFieldMapper.java
jordan-powers added a commit that referenced this pull request Mar 26, 2025
… source (#125709) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.
elasticsearchmachine pushed a commit that referenced this pull request Mar 27, 2025
… source (#125709) (#125746) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 689eaf2) # Conflicts: #	server/src/main/java/org/elasticsearch/index/IndexVersions.java #	x-pack/plugin/mapper-unsigned-long/src/main/java/org/elasticsearch/xpack/unsignedlong/UnsignedLongFieldMapper.java
jordan-powers added a commit that referenced this pull request Mar 28, 2025
…source (#125793) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
elastic#125529) This patch builds on the work in elastic#113757, elastic#122999, and elastic#124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
… source (elastic#125709) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, and elastic#125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.
omricohenn pushed a commit to omricohenn/elasticsearch that referenced this pull request Mar 28, 2025
…source (elastic#125793) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, elastic#125529, and elastic#125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.
elasticsearchmachine pushed a commit that referenced this pull request Mar 28, 2025
…source (#125793) (#125891) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 71e74bd) # Conflicts: #	server/src/main/java/org/elasticsearch/index/IndexVersions.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >enhancement :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine test-full-bwc Trigger full BWC version matrix tests v8.19.0 v9.1.0

5 participants