SentencePiece, Glossary, Textbook of AI

SentencePiece is an open-source, language-independent subword tokeniser introduced by Kudo and Richardson (2018) at Google. It powers the tokenisers of T5, mT5, ALBERT, XLNet, Llama (in its non-tiktoken form), and most multilingual models.

Design goals. Earlier subword tokenisers like BPE and WordPiece assume the text has been pre-tokenised into words by whitespace. This assumption is fragile for languages without word boundaries (Japanese, Chinese, Thai) and complicates round-tripping through the tokeniser. SentencePiece treats the input as a raw stream of Unicode characters or bytes, including spaces, and learns a vocabulary directly. Whitespace is explicitly encoded with the meta-symbol ▁ (U+2581) so that detokenisation is exactly the inverse of tokenisation.

Two segmentation modes.

BPE mode. Identical to byte-pair encoding but applied to the entire stream including whitespace. The merge table is learned by iteratively combining the most frequent adjacent symbol pair.
Unigram language model mode (Kudo, 2018). Posits a unigram distribution $p(t)$ over a candidate vocabulary $V$ and chooses the segmentation $\mathbf{t} = (t_1, \ldots, t_n)$ of input $x$ that maximises

$$p(\mathbf{t}) = \prod_{i=1}^{n} p(t_i)$$

subject to $\mathrm{concat}(\mathbf{t}) = x$. The vocabulary is initialised large (a superset of likely subwords) and pruned with the EM algorithm: in the E-step compute expected token counts via dynamic programming over all valid segmentations; in the M-step re-estimate $p(t)$; periodically remove the tokens whose removal causes the smallest drop in marginal likelihood, until $|V|$ reaches the target.

Subword regularisation. A unique feature of the unigram mode is that it can sample alternative segmentations during training. Given input $x$, sample $\mathbf{t} \sim p(\mathbf{t} \mid x)^{1/\tau}$ where $\tau$ is a temperature. This data augmentation improves robustness on noisy text and low-resource languages.

Detokenisation. Because whitespace is encoded as ▁, the detokeniser is simply

$$\text{detok}(\mathbf{t}) = \text{replace}(\text{concat}(\mathbf{t}),\; \text{`▁'},\; \text{` '}).$$

No language-specific rules are required.

Practical features.

Trained from raw text files, no pre-tokenisation step.
Reproducible: the same model file maps text to identical token IDs forever.
Configurable character coverage (e.g. $0.9995$ for multilingual, $1.0$ for English).
Byte fallback: unseen bytes are encoded as <0xFF> style tokens, guaranteeing total coverage like byte-level BPE.

SentencePiece's combination of language-agnostic input handling and unigram-LM segmentation has made it the default choice for new multilingual and instruction-tuned models.

Video

Related terms: Byte-Pair Encoding, WordPiece, Language Model

Discussed in:

Chapter 6: ML Fundamentals, Tokenisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).