WordPiece, Glossary, Textbook of AI

WordPiece is a subword tokenisation algorithm introduced by Schuster and Nakajima (2012) for Japanese and Korean voice search and adopted as the tokeniser for BERT, DistilBERT, and ELECTRA. Like BPE, it builds a fixed vocabulary of subword units, but the criterion for choosing each merge is different.

Likelihood-based merge criterion. Suppose the current corpus is segmented into a sequence of tokens with empirical unigram probabilities $p(t)$. The likelihood of the corpus under this segmentation is

$$\mathcal{L} = \sum_{t \in \text{corpus}} \log p(t).$$

For a candidate pair $(a, b)$, merging into a new token $ab$ changes the likelihood by approximately

$$\Delta \mathcal{L}(a, b) \;\approx\; \log \frac{p(ab)}{p(a)\, p(b)},$$

which is the pointwise mutual information between $a$ and $b$. WordPiece selects the pair maximising this quantity at each step, then merges it and re-estimates the unigram counts. By contrast, BPE picks the pair with the highest raw co-occurrence count $c(a, b)$.

Algorithm sketch.

Initialise the vocabulary with all characters and a special ## continuation marker.
Pre-tokenise into words; represent every word as a character sequence with ## prefixes on continuations, e.g. playing $\to$ p, ##l, ##a, ##y, ##i, ##n, ##g.
Compute the merge that maximises $\Delta \mathcal{L}$.
Apply the merge throughout the corpus, add the new token to the vocabulary.
Repeat until the vocabulary reaches the target size.

Encoding new text. WordPiece encodes a word greedily from left to right: at each position, take the longest prefix that exists in the vocabulary, emit it, and continue from the remainder. If no prefix matches, the word is replaced by [UNK]. The continuation marker ## distinguishes inside-word pieces (##ing) from word-initial pieces (ing).

BERT specifics. BERT's English vocabulary contains $30{,}522$ pieces. Punctuation is split off, the input is lowercased (for bert-base-uncased), and the [CLS] and [SEP] special tokens are prepended/appended. Multilingual BERT uses $119{,}547$ pieces shared across $104$ languages.

Comparison with BPE.

Property	WordPiece	BPE
Selection	Maximum likelihood gain	Highest pair frequency
Continuation marking	`##` prefix	None (space markers in some variants)
Implementation	Slightly more expensive (likelihood estimate)	Pure frequency counts
Used in	BERT family	GPT, RoBERTa, Llama

In practice the two produce qualitatively similar vocabularies; the choice is largely historical.

Video

Related terms: Byte-Pair Encoding, SentencePiece, BERT, Language Model

Discussed in:

Chapter 6: ML Fundamentals, Tokenisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).