Skip to content

Conversation

@DrownFish19
Copy link
Collaborator

PR types

New features

PR changes

APIs

Description

Add Fast Tokenizer.

  • Take the tokenizers as the backend of new fast tokenziers.
  • Compatible with the current tokenizers and new fast tokenizers.
  • LLaMA3.1 and LLaMA3 can use PretrainedTokenizerFast to achieve better performance. LLaMA 1 and LLaMA 2 also can use LlamaTokenizerFast to improve tokenization performance.
@paddle-bot
Copy link

paddle-bot bot commented Jul 30, 2024

Thanks for your contribution!

@DrownFish19 DrownFish19 force-pushed the dev_add_tokenizer_fast branch 3 times, most recently from 5b8dc52 to 5355615 Compare August 2, 2024 12:06
@codecov
Copy link

codecov bot commented Aug 2, 2024

Codecov Report

Attention: Patch coverage is 49.03537% with 317 lines in your changes missing coverage. Please review.

Project coverage is 54.81%. Comparing base (e0d2809) to head (e63092e).
Report is 225 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/convert_slow_tokenizer.py 21.73% 126 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_fast.py 60.93% 125 Missing ⚠️
paddlenlp/transformers/llama/tokenizer_fast.py 43.58% 44 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_base.py 58.69% 19 Missing ⚠️
paddlenlp/transformers/tokenizer_utils.py 70.00% 3 Missing ⚠️
Additional details and impacted files
@@ Coverage Diff @@ ## develop #8832 +/- ## =========================================== + Coverage 54.79% 54.81% +0.01%  =========================================== Files 636 639 +3 Lines 99876 100475 +599 =========================================== + Hits 54732 55079 +347  - Misses 45144 45396 +252 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ZHUI ZHUI merged commit d2d4d92 into PaddlePaddle:develop Aug 19, 2024
@DrownFish19 DrownFish19 deleted the dev_add_tokenizer_fast branch August 19, 2024 03:12
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* add fast tokenizer * add convert slow tokenizer method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants