Skip to content

Conversation

@benwtrent
Copy link
Member

Many multi-lingual and newer models use a tokenization scheme similar to sentence-piece. This PR adds support for one of those tokenization schemes, XLMRoBERTa.

The main changes are:

  • Support for xlm_roberta tokenization configuration
  • Adding scores to the vocabulary document stored, requiring that scores be the same size as the vocabulary
  • Adding a new flat text file to resources that is the spm char normalizer.
@benwtrent benwtrent added >feature :ml Machine learning v8.8.0 labels Feb 23, 2023
@github-actions
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Feb 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@davidkyle davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 25, 2023
@davidkyle
Copy link
Member

@elasticmachine update branch

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent requested review from davidkyle and szabosteve June 6, 2023 18:25
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing the docs for these changes! LGTM!
I've added xlm_roberta to the list of tokenization values via f0748b8, hope you don't mind.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can you look at the transport version number please

if (vocabulary.isEmpty()) {
validationException = addValidationError("[vocabulary] must not be empty", validationException);
} else {
if (scores.isEmpty() == false && scores.size() != vocabulary.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (scores.isEmpty() == false && scores.size() != vocabulary.size()) {
if (scores.size() != vocabulary.size()) {
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 13, 2023
@elasticsearchmachine elasticsearchmachine merged commit 14ca8fe into elastic:main Jun 13, 2023
@benwtrent benwtrent deleted the feature/ml-add-xlm-roberta-support branch June 13, 2023 12:41
benwtrent added a commit to elastic/eland that referenced this pull request Jun 14, 2023
This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089
picandocodigo pushed a commit to elastic/eland that referenced this pull request Jul 11, 2023
This allows XLMRoberta models to be uploaded to Elasticsearch. blocked by: elastic/elasticsearch#94089
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) cloud-deploy Publish cloud docker image for Cloud-First-Testing >feature :ml Machine learning Team:ML Meta label for the ML team v8.9.0

6 participants