Skip to content

Conversation

@DrownFish19
Copy link
Collaborator

@DrownFish19 DrownFish19 commented Aug 23, 2024

PR types

Bug fixes

PR changes

Others

Description

The new tokenizer_config.json now includes the added_tokens_decoder, and we load them in the PretrainedTokenizer _pre_init.

  1. 解决llama、gemma、mamba无法添加token的问题。
  2. 当前添加的token和原始的added_token_decoder最后都会保存在added_token_decoder:dict中,可下次加载并且序号不变。
  3. 当前added_token_decoder可被from_pretrained加载,保证tokenizer_config.json中序号不变。
@paddle-bot
Copy link

paddle-bot bot commented Aug 23, 2024

Thanks for your contribution!

@DrownFish19 DrownFish19 changed the title [tokenizer] fix added_tokens_decoder load [Tokenizer] fix added_tokens_decoder load Aug 23, 2024
@codecov
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.

Project coverage is 53.89%. Comparing base (9f6b486) to head (d6f2f38).
Report is 239 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/gemma/tokenizer.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@ Coverage Diff @@ ## develop #8997 +/- ## =========================================== - Coverage 54.51% 53.89% -0.63%  =========================================== Files 648 652 +4 Lines 103473 104388 +915 =========================================== - Hits 56406 56255 -151  - Misses 47067 48133 +1066 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"""
return len(self.encoder)

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mamba tokenizer的added_tokens_decoder中包含 [0,1]两个重复tokens,之前的计算方式会重复计算这两个token

"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

@DrownFish19 DrownFish19 changed the title [Tokenizer] fix added_tokens_decoder load [Tokenizer] support added_tokens_decoder load Aug 28, 2024
@DrownFish19 DrownFish19 changed the title [Tokenizer] support added_tokens_decoder load [Tokenizer] Support for loading added_tokens_decoder Aug 28, 2024
Copy link
Member

@JunnYu JunnYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mamba OK

@DrownFish19 DrownFish19 merged commit 3e7c5ca into PaddlePaddle:develop Aug 28, 2024
@DrownFish19 DrownFish19 deleted the dev_20240823_fix_added_tokens_decoder_load branch August 28, 2024 12:38
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* fix added_tokens_decoder load * fix decode * fix saving and loading added_token_decoder * fix mamba * fix special_tokens_map_file load * fix gemma tokenizer * fix llama tokenzier * revert llama tokenizer * fix _decode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants