Skip to content

Conversation

parkertimmins
Copy link
Contributor

Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding.

@parkertimmins parkertimmins requested a review from a team as a code owner August 8, 2025 16:52
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 labels Aug 8, 2025
@parkertimmins parkertimmins added auto-backport Automatically create backport pull requests when merged v9.1.0 v8.19.1 :StorageEngine/Mapping The storage related side of mappings and removed needs:triage Requires assignment of a team area label labels Aug 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

Copy link
Contributor

@jordan-powers jordan-powers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for fixing this! Glad to see the randomized testing is finding bugs

@parkertimmins parkertimmins changed the title Strings outside BMP have 2 chars per code points Code points with 4 bytes in UTF-8 will use 2 chars in UTF-16 Aug 8, 2025
@parkertimmins parkertimmins changed the title Code points with 4 bytes in UTF-8 will use 2 chars in UTF-16 Calculate text string length correctly for code points outside BMP Aug 8, 2025
@parkertimmins parkertimmins merged commit fa6e905 into elastic:main Aug 8, 2025
33 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 132593

parkertimmins added a commit to parkertimmins/elasticsearch that referenced this pull request Aug 8, 2025
…lastic#132593) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905)
@parkertimmins
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.1
8.19

Questions ?

Please refer to the Backport tool documentation

parkertimmins added a commit to parkertimmins/elasticsearch that referenced this pull request Aug 8, 2025
…lastic#132593) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905) # Conflicts: #	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Aug 8, 2025
…132593) (#132598) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905)
elasticsearchmachine pushed a commit that referenced this pull request Aug 8, 2025
…132593) (#132599) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905) # Conflicts: #	muted-tests.yml
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 - good catch Parker!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.19.1 v9.1.0 v9.2.0

4 participants