Skip to content

Conversation

@wj-Mcat
Copy link
Contributor

@wj-Mcat wj-Mcat commented Jan 8, 2024

PR types

Bug fixes

PR changes

Tokenizer

Description

更新 chatglm2、3 的 tokenzier

@paddle-bot
Copy link

paddle-bot bot commented Jan 8, 2024

Thanks for your contribution!

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jan 9, 2024

当前代码测试 chatglm 的 tokenizer

from paddlenlp.transformers import AutoTokenizer def print_special_tokens(tokenizer): tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"] role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"] tokens = tokens + role_special_tokens for token in tokens: print("============================================================") print("token ->", token) tokens = tokenizer.tokenize(token) print("tokens->", tokens) ids = tokenizer.convert_tokens_to_ids([token]) print("ids ->", ids) model_names = ["THUDM/chatglm-6b-v1.1", "THUDM/chatglm2-6b", "THUDM/chatglm3-6b"] for model_name in model_names: tokenizer = AutoTokenizer.from_pretrained(model_name) print_special_tokens(tokenizer)

日志

/root/paddlejob/workspace/envs_paddle/wjj/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") �[32m[2024-01-10 15:07:05,338] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm.tokenizer.ChatGLMTokenizer'> to load 'THUDM/chatglm-6b-v1.1'.�[0m �[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/ice_text.model�[0m �[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m �[33m[2024-01-10 15:07:05,388] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json> not exist�[0m �[32m[2024-01-10 15:07:05,389] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m �[33m[2024-01-10 15:07:05,425] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json> not exist�[0m �[32m[2024-01-10 15:07:05,425] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,425] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/chat_template.json�[0m ============================================================ token -> [MASK] tokens-> ['[MASK]'] ids -> [130000] ============================================================ token -> [gMASK] tokens-> ['▁[', 'g', 'MASK', ']'] ids -> [130001] ============================================================ token -> [sMASK] tokens-> ['▁[', 's', 'MASK', ']'] ids -> [130002] ============================================================ token -> sop tokens-> ['▁so', 'p'] ids -> [0] ============================================================ token -> eop tokens-> ['▁e', 'op'] ids -> [0] ============================================================ token -> <|system|> tokens-> ['▁<', '|', 'system', '|', '>'] ids -> [0] ============================================================ token -> <|user|> tokens-> ['▁<', '|', 'user', '|', '>'] ids -> [0] ============================================================ token -> <|assistant|> tokens-> ['▁<', '|', 'assistant', '|', '>'] ids -> [0] ============================================================ token -> <|observation|> tokens-> ['▁<', '|', 'observation', '|', '>'] ids -> [0] �[32m[2024-01-10 15:07:05,610] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,613] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm2-6b'.�[0m �[32m[2024-01-10 15:07:05,614] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer.model�[0m �[32m[2024-01-10 15:07:05,614] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m �[33m[2024-01-10 15:07:05,649] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json> not exist�[0m �[32m[2024-01-10 15:07:05,649] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m �[33m[2024-01-10 15:07:05,693] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json> not exist�[0m �[32m[2024-01-10 15:07:05,694] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,694] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/chat_template.json�[0m ============================================================ token -> [MASK] tokens-> ['[MASK]'] ids -> [64789] ============================================================ token -> [gMASK] tokens-> ['[gMASK]'] ids -> [64790] ============================================================ token -> [sMASK] tokens-> ['[sMASK]'] ids -> [64791] ============================================================ token -> sop tokens-> ['sop'] ids -> [64792] ============================================================ token -> eop tokens-> ['eop'] ids -> [64793] ============================================================ token -> <|system|> tokens-> ['▁<', '|', 'system', '|', '>'] ids -> [0] ============================================================ token -> <|user|> tokens-> ['▁<', '|', 'user', '|', '>'] ids -> [0] ============================================================ token -> <|assistant|> tokens-> ['▁<', '|', 'ass', 'istant', '|', '>'] ids -> [0] ============================================================ token -> <|observation|> tokens-> ['▁<', '|', 'ob', 'serv', 'ation', '|', '>'] ids -> [0] �[32m[2024-01-10 15:07:05,729] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,729] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm3-6b'.�[0m �[32m[2024-01-10 15:07:05,730] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer.model�[0m �[32m[2024-01-10 15:07:05,730] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m �[33m[2024-01-10 15:07:05,770] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json> not exist�[0m �[32m[2024-01-10 15:07:05,771] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m �[33m[2024-01-10 15:07:05,806] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json> not exist�[0m �[32m[2024-01-10 15:07:05,806] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m �[32m[2024-01-10 15:07:05,806] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/chat_template.json�[0m ============================================================ token -> [MASK] tokens-> ['[MASK]'] ids -> [64789] ============================================================ token -> [gMASK] tokens-> ['[gMASK]'] ids -> [64790] ============================================================ token -> [sMASK] tokens-> ['[sMASK]'] ids -> [64791] ============================================================ token -> sop tokens-> ['sop'] ids -> [64792] ============================================================ token -> eop tokens-> ['eop'] ids -> [64793] ============================================================ token -> <|system|> tokens-> ['<|system|>'] ids -> [64794] ============================================================ token -> <|user|> tokens-> ['<|user|>'] ids -> [64795] ============================================================ token -> <|assistant|> tokens-> ['<|assistant|>'] ids -> [64796] ============================================================ token -> <|observation|> tokens-> ['<|observation|>'] ids -> [64797]
@codecov
Copy link

codecov bot commented Jan 10, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (17acf22) 57.42% compared to head (ebe8c6d) 56.96%.
Report is 19 commits behind head on develop.

Files Patch % Lines
paddlenlp/transformers/chatglm_v2/tokenizer.py 90.00% 2 Missing ⚠️
Additional details and impacted files
@@ Coverage Diff @@ ## develop #7797 +/- ## =========================================== - Coverage 57.42% 56.96% -0.46%  =========================================== Files 585 587 +2 Lines 87976 88647 +671 =========================================== - Hits 50517 50498 -19  - Misses 37459 38149 +690 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wj-Mcat wj-Mcat marked this pull request as ready for review January 12, 2024 07:35
Copy link
Member

@JunnYu JunnYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JunnYu JunnYu merged commit b44f888 into PaddlePaddle:develop Jan 12, 2024
JunnYu pushed a commit that referenced this pull request Jan 12, 2024
* update chatglm tokenizer * update chatglm2 tokenizer * update chatglm2 tokenizer * update max & src slider * add chatglm2 tokenizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants