Tokenizers#

NeMo 1.0 (Previous Release)#

In NeMo 1.0, tokenizers were configured in the tokenizer section of the YAML configuration file.

NeMo 2.0 (New Release)#

In NeMo 2.0, tokenizers can be initialized directly in Python. get_nmt_tokenizer is a utility function used in NeMo to instantiate many of the common tokenizers used for llm and multimodal training. For example, the following code will construct a GPT2BPETokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="megatron", model_name="GPT2BPETokenizer", vocab_file="/path/to/vocab", merges_file="/path/to/merges", ) 

The following will construct a SentencePiece tokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="sentencepiece", tokenizer_model='/path/to/sentencepiece/model' ) 

The following will construct a Hugging Face tokenizer:

from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer tokenizer = get_nmt_tokenizer( library="huggingface", model_name='nvidia/Minitron-4B-Base', use_fast=True, ) 

Refer to the get_nmt_tokenizer code for a full list of supported arguments.

To set up the tokenizer using nemo_run, use the following code:

import nemo_run as run from nemo.collections.common.tokenizers import SentencePieceTokenizer from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer # Set up Sentence Piece tokenizer tokenizer = run.Config(SentencePieceTokenizer, model_path="/path/to/tokenizer.model") # Set up Hugging Face tokenizer tokenizer = run.Config(AutoTokenizer, pretrained_model_name="/path/to/tokenizer/model") 

Refer to the SentencePieceTokenizer or AutoTokenizer code for a full list of supported arguments.

To change the tokenizer path for model recipe, use the following code:

from nemo.collections import llm recipe = partial(llm.llama3_8b)() # Change path for Hugging Face tokenizer recipe.data.tokenizer.pretrained_model_name = "/path/to/tokenizer/model" # Change tokenizer path for Sentence Piece tokenizer recipe.data.tokenizer.model_path = "/path/to/tokenizer.model" 

Basic NeMo 2.0 recipes can contain predefined tokenizers. Visit this page to see an example of setting up the tokenizer in the recipe.