Store arrays offsets for keyword fields natively with synthetic source #113757

martijnvg · 2024-09-30T06:50:42Z

The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field.

For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"].
Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2]

Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1]

Empty arrays are also supported by encoding a zigzag vint array of zero elements.

Limitations:

currently only doc values based array support for keyword field mapper.
multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c]
arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"].

These limitations can be addressed, but some require more complexity and or additional storage.

With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.

…rrays

salvatore-campagna · 2024-09-30T07:55:00Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParserContext.java

+
+ public int getArrayValueCount(String field) {
+ if (numValuesByField.containsKey(field)) {
+ return numValuesByField.get(field) + 1;


numValuesByField returns the last offset, in order to get count that returned value needs to be incremented by one.

numValuesByField should be names something else.

salvatore-campagna · 2024-09-30T07:57:28Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParserContext.java

+ }
+
+ public void recordOffset(String fieldName, String value) {
+ int count = numValuesByField.compute(fieldName, (s, integer) -> integer == null ? 0 : ++integer);


Related to comment above, maybe 1 here instead of 0?

count is the wrong name here. It is like the next offset to be used.

salvatore-campagna · 2024-09-30T07:59:22Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+ ord++;
+ }
+
+ logger.info("values=" + values);


maybe debug?

salvatore-campagna · 2024-09-30T08:01:17Z

...rc/main/java/org/elasticsearch/index/mapper/SortedSetDocValuesSyntheticFieldLoaderLayer.java

+ ords[i] = dv.nextOrd();
+ }
+
+ logger.info("ords=" + Arrays.toString(ords));


I guess you are using info just for debugging while in draft.

Yes, this was for debugging purposes. This will be removed.

…rrays

lkts

This is a very powerful idea. Some thoughts:

When we generalize this, we need to think how to fit it into existing code nicely (leave poor DocumentParser alone). I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.
I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

lkts · 2024-10-03T21:00:26Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+ }
+ }
+
+ public void processOffsets(DocumentParserContext context) throws IOException {


Should this be called from postParse? It is possible that a field is indexed multiple times in one document with object arrays.

I am surprised randomized tests don't complain about this.

👍 I think postParse is a better place to invoke this process offsets logic.

martijnvg · 2024-10-04T08:52:07Z

Thanks for taking a look @lkts!

I was thinking lately that maybe DocumentParser can produce events like "parsing array", "parsing object", "parsing value" and then we can subscribe to such events and do our thing. Didn't dive too deep into this though.

I think we already have something like this via the FieldMapper#parsesArrayValue() flag. I initially tried using this, because I don't like introducing more complexity in DocumentParser. However it didn't work out in all cases. I recall that tests using copy to failed. For some reason field that overwrite that method and return true, are not taken into account with copy to. Maybe we should make field mappers that chose to overwrite FieldMapper#parsesArrayValue() work with copy_to correctly first.

I wonder if we can have a byte or two in the beginning of such encoding that can carry meta information. An example would be an "empty array" flag or "single element array".

Good point, I think we can have an info byte where this kind of information can be encoded. I was thinking something similar earlier, but left this out of this draft PR in order to keep it simple for demonstration purposes.

…rrays

martijnvg · 2024-10-09T13:23:39Z

@salvatore-campagna and @lkts I made a few changes and keyword field mapper now overwrites parsesArrayValue(), so that it can parse arrays. This allows minimizing changes in DocumentParser.

…rrays

… source Backporting elastic#113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

… source (#122997) Backporting #113757 to 8.x branch. The keyword doc values field gets an extra sorted doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets zigzag vint encoded into a sorted doc values field. For example, in case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Null values are also supported. For example ["c", "b", null, "c"] results into sorted set doc values: ["b", "c"] with ordinals: 0 and 1. The offset array will be: [1, 0, -1, 1] Empty arrays are also supported by encoding a zigzag vint array of zero elements. Limitations: currently only doc values based array support for keyword field mapper. multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage. With this PR, keyword field array will no longer be stored in ignored source, but array offsets are kept track of in an adjacent sorted doc value field. This only applies if index.mapping.synthetic_source_keep is set to arrays (default for logsdb).

Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

…22999) Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

Backporting elastic#122999 to 8.x branch. Follow up of elastic#113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

…ce (#123405) * [8.x] Store arrays offsets for ip fields natively with synthetic source Backporting #122999 to 8.x branch. Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source. * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

…4594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.

…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.

#124594) | Fix ignores malformed testcase (#125337) | Fix offsets not recording duplicate values (#125354) (#125440) * Natively store synthetic source array offsets for numeric fields (#124594) This patch builds on the work in #122999 and #113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`. (cherry picked from commit 376abfe) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java * Fix ignores malformed testcase (#125337) Fix and unmute testSynthesizeArrayRandomIgnoresMalformed (cherry picked from commit 2ff03ac) # Conflicts: # muted-tests.yml * Fix offsets not recording duplicate values (#125354) Previously, when calculating the offsets, we just compared the values as-is without any loss of precision. However, when the values were saved into doc values and loaded in the doc values loader, they could have lost precision. This meant that values that were not duplicates when calculating the offsets could now be duplicates in the doc values loader. This interfered with the de-duplication logic, causing incorrect values to be returned. My solution is to apply the precision loss before calculating the offsets, so that both the offsets calculation and the SortedNumericDocValues de-duplication see the same values as duplicates. (cherry picked from commit db73175)

#125529) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.

#125529) (#125596) This patch builds on the work in #113757, #122999, and #124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`. (cherry picked from commit af1f145) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/BooleanFieldMapper.java

… source (#125709) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.

… source (#125709) (#125746) This patch builds on the work in #113757, #122999, #124594, and #125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 689eaf2) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java # x-pack/plugin/mapper-unsigned-long/src/main/java/org/elasticsearch/xpack/unsignedlong/UnsignedLongFieldMapper.java

…source (#125793) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…stic#124594) This patch builds on the work in elastic#122999 and elastic#113757 to natively store array offsets for numeric fields instead of falling back to ignored source when `source_keep_mode: arrays`.

elastic#125529) This patch builds on the work in elastic#113757, elastic#122999, and elastic#124594 to natively store array offsets for boolean fields instead of falling back to ignored source when `synthetic_source_keep: arrays`.

… source (elastic#125709) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, and elastic#125529 to natively store array offsets for unsigned long fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…source (elastic#125793) This patch builds on the work in elastic#113757, elastic#122999, elastic#124594, elastic#125529, and elastic#125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays.

…source (#125793) (#125891) This patch builds on the work in #113757, #122999, #124594, #125529, and #125709 to natively store array offsets for scaled float fields instead of falling back to ignored source when synthetic_source_keep: arrays. (cherry picked from commit 71e74bd) # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexVersions.java

martijnvg added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Sep 30, 2024

elasticsearchmachine added the v9.0.0 label Sep 30, 2024

martijnvg added 3 commits September 30, 2024 08:58

spotless

acf4d09

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

dca77d7

…rrays

iter

49efe26

salvatore-campagna reviewed Sep 30, 2024

View reviewed changes

martijnvg added 5 commits September 30, 2024 16:27

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

f5e3d5a

…rrays

iter

59010c3

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

a5198ae

…rrays

spotless

9b4aa5f

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

12d30c5

…rrays

lkts reviewed Oct 3, 2024

View reviewed changes

martijnvg added 9 commits October 8, 2024 13:35

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

2ae8d83

…rrays

parsesArrayValue approach

fc0e627

fix multi fields

a111f94

iter

14c2ddd

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

ba9e513

…rrays

do not handle copy_to for now

007afd3

cleanup

194b4ca

move ValueXContentParser

52c0db4

adjust expected json element type

8cc5b46

martijnvg added 5 commits October 13, 2024 16:48

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

dc9db8a

…rrays

iter

6e03aca

fixed mistake

674f03e

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

acfaa55

…rrays

Merge remote-tracking branch 'es/main' into synthetic_source_encode_a…

0d90234

…rrays

elasticsearchmachine added the backport pending label Feb 20, 2025

martijnvg mentioned this pull request Feb 20, 2025

[8.x] Store arrays offsets for keyword fields natively with synthetic source #122997

Merged

martijnvg mentioned this pull request Feb 20, 2025

Store arrays offsets for ip fields natively with synthetic source #122999

Merged

martijnvg removed the backport pending label Feb 21, 2025

martijnvg added a commit that referenced this pull request Feb 25, 2025

Store arrays offsets for ip fields natively with synthetic source (#1…

6c55099

…22999) Follow up of #113757 and adds support to natively store array offsets for ip fields instead of falling back to ignored source.

martijnvg mentioned this pull request Feb 25, 2025

[8.x] Store arrays offsets for ip fields natively with synthetic source #123405

Merged

jordan-powers mentioned this pull request Mar 11, 2025

Store arrays offsets for numeric fields natively with synthetic source #124594

Merged

jordan-powers mentioned this pull request Mar 24, 2025

Store arrays offsets for boolean fields natively with synthetic source #125529

Merged

jordan-powers mentioned this pull request Mar 26, 2025

Store arrays offsets for unsigned long fields natively with synthetic source #125709

Merged

jordan-powers mentioned this pull request Mar 27, 2025

Store arrays offsets for scaled float fields natively with synthetic source #125793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store arrays offsets for keyword fields natively with synthetic source #113757

Store arrays offsets for keyword fields natively with synthetic source #113757

Uh oh!

martijnvg commented Sep 30, 2024 •

edited

Loading

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

salvatore-campagna Sep 30, 2024

salvatore-campagna Sep 30, 2024

martijnvg Sep 30, 2024

lkts left a comment •

edited

Loading

lkts Oct 3, 2024

lkts Oct 3, 2024

martijnvg Oct 4, 2024

martijnvg commented Oct 4, 2024 •

edited

Loading

martijnvg commented Oct 9, 2024

Labels

5 participants

Store arrays offsets for keyword fields natively with synthetic source #113757

Store arrays offsets for keyword fields natively with synthetic source #113757

Uh oh!

Conversation

martijnvg commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkts left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

martijnvg commented Oct 9, 2024

Labels

5 participants

martijnvg commented Sep 30, 2024 •

edited

Loading

lkts left a comment •

edited

Loading

martijnvg commented Oct 4, 2024 •

edited

Loading