How to split text by tokens
This guide assumes familiarity with the following concepts:
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
js-tiktoken
β
js-tiktoken is a JavaScript version of the BPE
tokenizer created by OpenAI.
We can use js-tiktoken
to estimate tokens used. It is tuned to OpenAI models.
- How the text is split: by character passed in.
- How the chunk size is measured: by the
js-tiktoken
tokenizer.
You can use the TokenTextSplitter
like this:
import { TokenTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";
// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();
const textSplitter = new TokenTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
});
const texts = await textSplitter.splitText(stateOfTheUnion);
console.log(texts[0]);
Madam Speaker, Madam Vice President, our
Note: Some written languages (e.g.Β Chinese and Japanese) have characters which encode to 2 or more tokens. Using the TokenTextSplitter
directly can split the tokens for a character between two chunks causing malformed Unicode characters.
Next stepsβ
Youβve now learned a method for splitting text based on token count.
Next, check out the full tutorial on retrieval-augmented generation.