Tokenisation is the process of splitting text into discrete units—tokens—that a language model can process. These tokens might be words, subwords, characters, or bytes. The choice of tokenisation scheme affects vocabulary size, sequence length, and how well the model handles rare words, misspellings, and multiple languages.
Word-level tokenisation is intuitive but has problems: vocabularies must be large to cover all words, rare words become out-of-vocabulary, and morphological variants like "run/running/ran" are treated as unrelated. Character-level tokenisation avoids these problems but produces very long sequences. Subword tokenisation offers the best of both: common words become single tokens while rare ones are broken into subword pieces. Byte-Pair Encoding (BPE), used by GPT-2 and GPT-3, iteratively merges the most frequent adjacent token pairs. WordPiece, used by BERT, uses a similar merging criterion based on maximising likelihood. SentencePiece operates directly on raw text without pre-tokenisation, handling any language uniformly.
Modern LLMs typically use vocabularies of 30,000 to 200,000 subword tokens. Byte-level BPE (used in GPT-4, LLaMA) operates on UTF-8 bytes, guaranteeing that any text can be represented. Tokenisation choices have subtle but significant consequences: they affect multilingual performance, handling of code, numerical reasoning, and even the quality of generated text. The "tokeniser tax" can make some languages two or three times more expensive to process than others. Understanding tokenisation is essential for anyone working with modern language models.
Related terms: Language Model, Large Language Model
Discussed in:
- Chapter 12: Sequence Models — Language Models
Also defined in: Textbook of AI