Skip to content

Conversation

@davidkyle
Copy link
Member

Certain inputs strings can be tokenised in such a way that the token is an accent that is later removed if strip accents is enabled. This results and an empty string that the WordPiece algorithm skips over but the code expects at least one token to be produced.

The fix is to have the basic token filter proceed to the next token if the token is an empty string.

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good.

My main concern is that we test edge cases.

The entire input is accents (if possible) and the empty accents to strip are at the end of the input.

@davidkyle
Copy link
Member Author

Thanks for the review, I add tests to cover those edge cases:

  • When the input ends in the accent it is truncated
  • When the input is only accents the result is an empty tokenisation containing only the CLS and SEP tokens
@davidkyle davidkyle merged commit 40bb2dd into elastic:main Jul 4, 2023
@davidkyle davidkyle deleted the delimiter-fix branch July 4, 2023 15:00
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Jul 4, 2023
… empty string (elastic#97354) Stripping accents sometimes results in an empty string which is then skipped by the WordPiece function causing an error. The basic tokenizer now consumes tokens until a non empty string is found
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.9
elasticsearchmachine pushed a commit that referenced this pull request Jul 4, 2023
… empty string (#97354) (#97367) Stripping accents sometimes results in an empty string which is then skipped by the WordPiece function causing an error. The basic tokenizer now consumes tokens until a non empty string is found
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :ml Machine learning Team:ML Meta label for the ML team v8.9.0 v8.10.0

3 participants