Sequence Models: 12.5   Tokenisation

Dr Chris Paton

12.5 Tokenisation

We have so far spoken loosely of "words". A real text-processing pipeline must actually decide what counts as a token. The naive choice, split on whitespace, apply some normalisation, fails for several reasons. It produces a vocabulary of unbounded size as new corpora are added (every typo and proper name becomes a new word). It cannot handle out-of-vocabulary words at test time. It treats morphologically related forms as unrelated. And in many languages (Chinese, Japanese, Thai) there is no whitespace between words at all.

Modern systems use subword tokenisation: split text into pieces shorter than words but longer than characters, with a fixed vocabulary size (typically 32k to 200k pieces). Three algorithms dominate.

12.5.1 Byte-pair encoding (BPE)

Byte-pair encoding Sennrich, 2016 was originally a data-compression algorithm and was adapted by Sennrich et al. (2016) for neural machine translation. The training procedure is:

Initialise the vocabulary with all characters appearing in the corpus.
Represent each word as a sequence of characters, with an end-of-word marker (commonly written </w> or a trailing space).
Count all adjacent symbol pairs in the corpus.
Merge the most frequent pair into a new symbol.
Add the new symbol to the vocabulary; rewrite the corpus accordingly.
Repeat from step 3 until the vocabulary reaches the desired size.

Worked example. Suppose the corpus consists of the words "low", "lowest", "newer", "wider", with frequencies $5$, $2$, $6$, $3$ respectively. Initialise with character vocabulary $\{\text{l}, \text{o}, \text{w}, \text{e}, \text{s}, \text{t}, \text{n}, \text{r}, \text{i}, \text{d}, \text{}\}$. The corpus, expanded character-by-character with </w>, has these adjacent pair counts (listing only the most relevant):

(l, o): $5 + 2 = 7$
(o, w): $5 + 2 = 7$
(w, ): $5$
(e, s): $2$, (s, t): $2$, (t, ): $2$
(n, e): $6$, (e, w): $6$, (w, e): $6$, (e, r): $6 + 3 = 9$, (r, ): $6 + 3 = 9$
(w, i): $3$, (i, d): $3$, (d, e): $3$

The most frequent pair is (e, r) with count 9. We merge it: the new symbol er is added, and every adjacent (e, r) becomes er. The corpus becomes "low", "lowest", "n e w er ", "w i d er " (with frequencies as before). We re-count pairs and merge the next most frequent. After a few iterations a typical run produces merges like (e, r), (er, ), (l, o), (lo, w), (n, e), (ne, w); the system is discovering common subword patterns automatically.

At test time, each word is greedily segmented using the learned merges, longest match first. Out-of-vocabulary words decompose into known subwords and, in the worst case, individual characters. A vocabulary of around 32k merges suffices for most European-language tasks; multilingual and code-trained models use larger vocabularies. GPT-2 used a 50k BPE vocabulary; GPT-4 expanded to roughly 100k tokens (cl100k_base); Llama 3 uses 128k, Qwen 2.5 uses 152k, and GPT-4o's o200k_base reaches around 200k.

12.5.2 WordPiece

WordPiece (Schuster and Nakajima 2012, popularised by Devlin et al. 2019 in BERT) 2019 is similar in spirit to BPE but chooses each merge to maximise the likelihood of the training data under a unigram language model on subwords. Concretely, instead of merging the most frequent pair, WordPiece merges the pair $(x, y)$ that maximises

$$\text{score}(x, y) = \frac{\text{count}(xy)}{\text{count}(x) \cdot \text{count}(y)},$$

which favours pairs whose joint frequency is high relative to what would be expected if the symbols were independent. WordPiece typically uses ## to mark continuation pieces: "tokenisation" might be split as token, ##isation. Empirically, BPE and WordPiece produce very similar segmentations.

12.5.3 Unigram and SentencePiece

The Unigram language model (Kudo 2018) takes a different approach: start with a large initial vocabulary (often via BPE) and iteratively remove tokens to maximise corpus likelihood. It maintains a probabilistic segmentation rather than a deterministic one: a word can be tokenised in multiple ways, and at test time the most probable segmentation is chosen via Viterbi.

SentencePiece (Kudo and Richardson 2018) is a software package that implements both BPE and Unigram, with the additional design choice of treating the input as a raw stream of bytes (or characters) without language-specific preprocessing. In particular, it represents the space character explicitly (▁) and treats word boundaries as part of the tokenisation, making it suitable for whitespace-free languages. Most modern multilingual models (T5, mT5, ALBERT, XLNet) use SentencePiece.

12.5.4 Why subword tokenisation matters

The choice of tokeniser is consequential. It determines the effective vocabulary, the average sequence length per document (which dominates compute), the model's behaviour on rare words, and its multilingual coverage. Errors in tokenisation propagate downstream: a model that sees "1 9 8 9" as four separate tokens has no reason to know that this is a date, and indeed early GPT models notoriously struggled with arithmetic for exactly this reason. Modern tokenisers carve out digits, punctuation, and contractions in deliberate ways to mitigate these effects.

A second, more recent concern: tokenisation is the primary source of unfairness across languages. Vocabularies trained predominantly on English produce many more tokens per word in low-resource and morphologically rich languages, which both inflates inference cost and reduces effective context length for those languages. The "Tokenizer-Tax" of running the same multilingual model on Burmese versus English can be a factor of five or more in tokens. Recent multilingual models (Llama 3, Gemma) have moved towards larger vocabularies (128k+) chosen to balance token budgets across the world's major languages.