WordPiece is a subword tokenisation algorithm introduced by Schuster and Nakajima (2012) for Japanese and Korean voice search and adopted as the tokeniser for BERT, DistilBERT, and ELECTRA. Like BPE, it builds a fixed vocabulary of subword units, but the criterion for choosing each merge is different.
Likelihood-based merge criterion. Suppose the current corpus is segmented into a sequence of tokens with empirical unigram probabilities $p(t)$. The likelihood of the corpus under this segmentation is
$$\mathcal{L} = \sum_{t \in \text{corpus}} \log p(t).$$
For a candidate pair $(a, b)$, merging into a new token $ab$ changes the likelihood by approximately
$$\Delta \mathcal{L}(a, b) \;\approx\; \log \frac{p(ab)}{p(a)\, p(b)},$$
which is the pointwise mutual information between $a$ and $b$. WordPiece selects the pair maximising this quantity at each step, then merges it and re-estimates the unigram counts. By contrast, BPE picks the pair with the highest raw co-occurrence count $c(a, b)$.
Algorithm sketch.
- Initialise the vocabulary with all characters and a special
##continuation marker. - Pre-tokenise into words; represent every word as a character sequence with
##prefixes on continuations, e.g.playing$\to$p,##l,##a,##y,##i,##n,##g. - Compute the merge that maximises $\Delta \mathcal{L}$.
- Apply the merge throughout the corpus, add the new token to the vocabulary.
- Repeat until the vocabulary reaches the target size.
Encoding new text. WordPiece encodes a word greedily from left to right: at each position, take the longest prefix that exists in the vocabulary, emit it, and continue from the remainder. If no prefix matches, the word is replaced by [UNK]. The continuation marker ## distinguishes inside-word pieces (##ing) from word-initial pieces (ing).
BERT specifics. BERT's English vocabulary contains $30{,}522$ pieces. Punctuation is split off, the input is lowercased (for bert-base-uncased), and the [CLS] and [SEP] special tokens are prepended/appended. Multilingual BERT uses $119{,}547$ pieces shared across $104$ languages.
Comparison with BPE.
| Property | WordPiece | BPE |
|---|---|---|
| Selection | Maximum likelihood gain | Highest pair frequency |
| Continuation marking | ## prefix |
None (space markers in some variants) |
| Implementation | Slightly more expensive (likelihood estimate) | Pure frequency counts |
| Used in | BERT family | GPT, RoBERTa, Llama |
In practice the two produce qualitatively similar vocabularies; the choice is largely historical.
Video
Related terms: Byte-Pair Encoding, SentencePiece, BERT, Language Model
Discussed in:
- Chapter 6: ML Fundamentals, Tokenisation