Perplexity is the standard intrinsic evaluation metric for language models. For a model $p_\theta$ and a held-out test set $x_{1:T}$:
$$\mathrm{PPL} = \exp\!\left(-\frac{1}{T} \sum_{t=1}^T \log p_\theta(x_t | x_{\lt t})\right) = \prod_{t=1}^T p_\theta(x_t | x_{\lt t})^{-1/T}$$
Equivalently: perplexity is the exponential of cross-entropy (in nats per token), or the geometric mean of the inverse predicted probabilities. Lower is better.
Intuition: a model with perplexity $K$ is, on average, "as confused" as a model that uniformly distributes probability over $K$ alternatives at each prediction. A perfect model has perplexity 1 (always assigns probability 1 to the correct token). A model that always predicts uniform over a vocabulary of $V$ has perplexity $V$.
Modern values (a rough guide):
- Random uniform over 50k-vocab: 50,000.
- Trigram model on news text: ~150.
- Original Transformer (2017) on WikiText-103: ~30.
- GPT-2 small (2019) on WikiText-103: ~25.
- GPT-3 175B on filtered web: ~6-8.
- Frontier 2024+ models: lower still, with substantial improvements from instruction tuning.
Cautions:
Perplexity depends on tokenisation. A model with finer tokens (e.g. BPE with smaller vocabulary) has higher per-token perplexity, but should be compared at the bits-per-character level for fair cross-tokeniser comparison. Always specify the tokeniser.
Perplexity measures predictive log-likelihood on the test set, it does not measure usefulness, helpfulness, factuality, reasoning, or many other capabilities relevant for downstream tasks. A model with lower perplexity on web text is not necessarily better at answering questions or writing code.
Perplexity is sensitive to domain mismatch: a model trained on news will have high perplexity on Twitter or on code, even if it's a "good" general model.
Bits-per-character (BPC) and bits-per-byte (BPB) are tokeniser-invariant alternatives, particularly useful for comparing across vocabulary choices:
$$\mathrm{BPC} = -\frac{1}{C} \sum_t \log_2 p_\theta(x_t | x_{\lt t})$$
where $C$ is the total number of characters.
Bits per word: perplexity converts to bits per word as $\log_2 \mathrm{PPL}$, useful for information-theoretic comparison.
Held-out test set requirement: perplexity must be evaluated on data the model has not seen. With web-scale models trained on all available data, ensuring true held-out evaluation is increasingly difficult and is a current methodological concern.
Video
Related terms: Cross-Entropy Loss, Language Model, Shannon Entropy
Discussed in:
- Chapter 12: Sequence Models, Sequence Models