Calculate text string length correctly for code points outside BMP #132593

parkertimmins · 2025-08-08T16:52:31Z

Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding.

elasticsearchmachine · 2025-08-08T16:54:55Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-08-08T16:54:56Z

Hi @parkertimmins, I've created a changelog YAML for you.

jordan-powers

Good catch, thanks for fixing this! Glad to see the randomized testing is finding bugs

elasticsearchmachine · 2025-08-08T18:14:21Z

💔 Backport failed

Status	Branch	Result
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 132593

…lastic#132593) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905)

parkertimmins · 2025-08-08T18:27:02Z

💚 All backports created successfully

Status	Branch	Result
✅	9.1
✅	8.19

Questions ?

Please refer to the Backport tool documentation

…lastic#132593) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905) # Conflicts: # muted-tests.yml

…132593) (#132598) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905)

…132593) (#132599) Strings parsed with the optimized UTF8 parsing have their length calculated during parsing. This length should be the same as the length if the string is parsed with the non-optimized path. Specifically, characters outside the basic multilingual plane require 2 chars per code point in the UTF16 encoding. (cherry picked from commit fa6e905) # Conflicts: # muted-tests.yml

martijnvg

LGTM2 - good catch Parker!

Strings outside BMP have 2 chars per code points

424b41d

parkertimmins requested review from jordan-powers and martijnvg August 8, 2025 16:52

parkertimmins added the >bug label Aug 8, 2025

parkertimmins requested a review from a team as a code owner August 8, 2025 16:52

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 labels Aug 8, 2025

parkertimmins added auto-backport Automatically create backport pull requests when merged v9.1.0 v8.19.1 :StorageEngine/Mapping The storage related side of mappings and removed needs:triage Requires assignment of a team area label labels Aug 8, 2025

elasticsearchmachine added the Team:StorageEngine label Aug 8, 2025

Update docs/changelog/132593.yaml

26bf4e9

jordan-powers approved these changes Aug 8, 2025

View reviewed changes

parkertimmins changed the title ~~Strings outside BMP have 2 chars per code points~~ Code points with 4 bytes in UTF-8 will use 2 chars in UTF-16 Aug 8, 2025

parkertimmins changed the title ~~Code points with 4 bytes in UTF-8 will use 2 chars in UTF-16~~ Calculate text string length correctly for code points outside BMP Aug 8, 2025

parkertimmins merged commit fa6e905 into elastic:main Aug 8, 2025
33 checks passed

This was referenced Aug 8, 2025

[CI] LogsDbVersusReindexedLogsDbChallengeRestIT testRandomQueries failing #132376

Closed

[CI] LogsDbVersusLogsDbReindexedIntoStandardModeChallengeRestIT testRandomQueries failing #132377

Closed

elasticsearchmachine added the backport pending label Aug 8, 2025

This was referenced Aug 8, 2025

[CI] LogsDbVersusReindexedIntoStoredSourceChallengeRestIT testRandomQueries failing #132378

Closed

[9.1] Calculate text string length correctly for code points outside BMP (#132593) #132598

Merged

parkertimmins mentioned this pull request Aug 8, 2025

[8.19] Calculate text string length correctly for code points outside BMP (#132593) #132599

Merged

martijnvg reviewed Aug 11, 2025

View reviewed changes

martijnvg removed the backport pending label Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calculate text string length correctly for code points outside BMP #132593

Calculate text string length correctly for code points outside BMP #132593

Uh oh!

parkertimmins commented Aug 8, 2025

elasticsearchmachine commented Aug 8, 2025

elasticsearchmachine commented Aug 8, 2025

jordan-powers left a comment

Uh oh!

elasticsearchmachine commented Aug 8, 2025

parkertimmins commented Aug 8, 2025

martijnvg left a comment

Labels

4 participants

Calculate text string length correctly for code points outside BMP #132593

Calculate text string length correctly for code points outside BMP #132593

Uh oh!

Conversation

parkertimmins commented Aug 8, 2025

elasticsearchmachine commented Aug 8, 2025

elasticsearchmachine commented Aug 8, 2025

jordan-powers left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Aug 8, 2025

💔 Backport failed

parkertimmins commented Aug 8, 2025

💚 All backports created successfully

Questions ?

martijnvg left a comment

Choose a reason for hiding this comment

Labels

4 participants