You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Fix][Tokenizer] Fix failure in decoding tokens for ByteLevel BPE (#2649)
This PR fixes the issue where the tokenizer would fail in decoding tokens for ByteLevel BPE when the token is not recognized by ByteLevel. E.g. in decoding, ``` "hello" -> "hello" (recognized by ByteLevel) "Ġthere" -> " there" (recognized by ByteLevel) "\n" -> not recognized by ByteLevel "\u203c" -> not recognized by ByteLevel ``` This PR adds the logic that in decoding, when the token is not recognized by ByteLevel, the original token will be returned. Then ``` "hello" -> "hello" (recognized by ByteLevel) "Ġthere" -> " there" (recognized by ByteLevel) "\n" -> "\n" (not recognized by ByteLevel) "\u203c" -> "\u203c" (not recognized by ByteLevel) ``` This behavior is align to huggingface tokenizers.
0 commit comments