Byte-Pair Encoding, Glossary, Textbook of AI

Byte-Pair Encoding (BPE) is a subword tokenisation algorithm originally proposed for data compression by Gage (1994) and adapted to neural machine translation by Sennrich, Haddow & Birch (2016). It is now the default tokeniser for GPT, RoBERTa, Llama, and many other large language models.

Motivation. Word-level tokenisation suffers from a fixed vocabulary that cannot represent rare or out-of-vocabulary words; character-level tokenisation produces overly long sequences. BPE compromises by representing frequent words as single tokens and rare words as concatenations of subword units, so any string is representable by some sequence of vocabulary tokens.

Algorithm. Given a training corpus, BPE proceeds:

Initialise the vocabulary as the set of individual characters (or bytes).
Pre-tokenise text into words and represent each word as a sequence of symbols, e.g. low becomes $(l, o, w)$.
Count the frequency of every adjacent symbol pair across the corpus.
Find the pair $(a, b)$ with the highest frequency and merge it into a new symbol $ab$, adding $ab$ to the vocabulary.
Replace every occurrence of $(a, b)$ in the corpus with $ab$.
Repeat steps 3–5 until the vocabulary reaches a target size $V$ (typically $30{,}000$ to $50{,}000$).

In pseudocode:

vocab = set(chars(corpus))
while |vocab| < V:
    pairs = count_adjacent_pairs(corpus)
    (a, b) = argmax(pairs)
    merge(a, b) in corpus
    vocab.add(a + b)

Encoding new text. At inference, the same merge operations are applied in the order they were learned. The result is deterministic given the merge table.

Byte-level BPE. GPT-2 introduced byte-level BPE: instead of starting from Unicode characters, it starts from the 256 raw bytes. This guarantees that any string can be encoded without an unknown token while keeping the base vocabulary small. Llama and most modern models follow this convention.

Properties.

Vocabulary size is a hyperparameter chosen for a memory–sequence-length trade-off.
Frequent words like the end up as single tokens; rare words decompose, e.g. unhappiness $\to$ un, happiness or finer.
The merge table is roughly $O(V)$ in size and the encoder runs in $O(n \log V)$ per word with a priority queue.

Limitations. BPE is purely frequency-driven, ignoring linguistic structure. It can split morphologically meaningful units inconsistently, and tokenisation differs across languages, which contributes to the well-documented English-bias in token efficiency for multilingual models.

Video

Related terms: WordPiece, SentencePiece, Language Model, GPT

Discussed in:

Chapter 6: ML Fundamentals, Tokenisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).