[ML] Fix WordPiece tokenization where stripping accents results in an empty string #97354

davidkyle · 2023-07-04T10:25:46Z

Certain inputs strings can be tokenised in such a way that the token is an accent that is later removed if strip accents is enabled. This results and an empty string that the WordPiece algorithm skips over but the code expects at least one token to be produced.

The fix is to have the basic token filter proceed to the next token if the token is an empty string.

elasticsearchmachine · 2023-07-04T10:26:11Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2023-07-04T10:26:11Z

Hi @davidkyle, I've created a changelog YAML for you.

benwtrent

I think this is good.

My main concern is that we test edge cases.

The entire input is accents (if possible) and the empty accents to strip are at the end of the input.

davidkyle · 2023-07-04T13:49:54Z

Thanks for the review, I add tests to cover those edge cases:

When the input ends in the accent it is truncated
When the input is only accents the result is an empty tokenisation containing only the CLS and SEP tokens

… empty string (elastic#97354) Stripping accents sometimes results in an empty string which is then skipped by the WordPiece function causing an error. The basic tokenizer now consumes tokens until a non empty string is found

elasticsearchmachine · 2023-07-04T15:03:24Z

💚 Backport successful

Status	Branch	Result
✅	8.9

… empty string (#97354) (#97367) Stripping accents sometimes results in an empty string which is then skipped by the WordPiece function causing an error. The basic tokenizer now consumes tokens until a non empty string is found

Fix tokenization where a token is stripped

9cebd46

davidkyle added >bug :ml Machine learning v8.9.0 v8.10.0 labels Jul 4, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 4, 2023

Update docs/changelog/97354.yaml

d679196

checkstyle

5f28182

benwtrent approved these changes Jul 4, 2023

View reviewed changes

Test edge cases

891719c

davidkyle added the auto-backport-and-merge label Jul 4, 2023

davidkyle merged commit 40bb2dd into elastic:main Jul 4, 2023

davidkyle deleted the delimiter-fix branch July 4, 2023 15:00

davidkyle mentioned this pull request Jul 4, 2023

[8.9] [ML] Fix WordPiece tokenization where stripping accents results in an empty string (#97354) #97367

Merged

sihtam23 mentioned this pull request Nov 8, 2023

Bulk response can return errors and drops in incorrect order #101921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix WordPiece tokenization where stripping accents results in an empty string #97354

[ML] Fix WordPiece tokenization where stripping accents results in an empty string #97354

Uh oh!

davidkyle commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

benwtrent left a comment

davidkyle commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

Labels

3 participants

[ML] Fix WordPiece tokenization where stripping accents results in an empty string #97354

[ML] Fix WordPiece tokenization where stripping accents results in an empty string #97354

Uh oh!

Conversation

davidkyle commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

benwtrent left a comment

Choose a reason for hiding this comment

davidkyle commented Jul 4, 2023

elasticsearchmachine commented Jul 4, 2023

💚 Backport successful

Labels

3 participants