Tokenizers#
NeMo 1.0 (Previous Release)#
In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.
NeMo 2.0 (New Release)#
In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="megatron", model_name="GPT2BPETokenizer", vocab_file="/path/to/vocab", merges_file="/path/to/merges", ) The following will construct a SentencePiece tokenizer:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="sentencepiece", tokenizer_model='/path/to/sentencepiece/model' ) The following will construct a Hugging Face tokenizer:
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="huggingface", model_name='nvidia/Minitron-4B-Base', use_fast=True, ) Refer to the get_nmt_tokenizer code for a full list of supported arguments.
To set up the tokenizer using nemo_run, use the following code:
import nemo_run as run from nemo.collections.common.tokenizers import SentencePieceTokenizer from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer # Set up Sentence Piece tokenizer tokenizer = run.Config(SentencePieceTokenizer, model_path="/path/to/tokenizer.model") # Set up Hugging Face tokenizer tokenizer = run.Config(AutoTokenizer, pretrained_model_name="/path/to/tokenizer/model")
Refer to the SentencePieceTokenizer or AutoTokenizer code for a full list of supported arguments.
To change the tokenizer path for model recipe, use the following code:
from nemo.collections import llm recipe = partial(llm.llama3_8b)() # Change path for Hugging Face tokenizer recipe.data.tokenizer.pretrained_model_name = "/path/to/tokenizer/model" # Change tokenizer path for Sentence Piece tokenizer recipe.data.tokenizer.model_path = "/path/to/tokenizer.model" Basic NeMo 2.0 recipes can contain predefined tokenizers. Visit this page to see an example of setting up the tokenizer in the recipe.