Perplexity, Glossary, Textbook of AI

Perplexity is the standard intrinsic evaluation metric for language models. For a model $p_\theta$ and a held-out test set $x_{1:T}$:

$$\mathrm{PPL} = \exp\!\left(-\frac{1}{T} \sum_{t=1}^T \log p_\theta(x_t | x_{\lt t})\right) = \prod_{t=1}^T p_\theta(x_t | x_{\lt t})^{-1/T}$$

Equivalently: perplexity is the exponential of cross-entropy (in nats per token), or the geometric mean of the inverse predicted probabilities. Lower is better.

Intuition: a model with perplexity $K$ is, on average, "as confused" as a model that uniformly distributes probability over $K$ alternatives at each prediction. A perfect model has perplexity 1 (always assigns probability 1 to the correct token). A model that always predicts uniform over a vocabulary of $V$ has perplexity $V$.

Modern values (a rough guide):

Random uniform over 50k-vocab: 50,000.
Trigram model on news text: ~150.
Original Transformer (2017) on WikiText-103: ~30.
GPT-2 small (2019) on WikiText-103: ~25.
GPT-3 175B on filtered web: ~6-8.
Frontier 2024+ models: lower still, with substantial improvements from instruction tuning.

Cautions:

Perplexity depends on tokenisation. A model with finer tokens (e.g. BPE with smaller vocabulary) has higher per-token perplexity, but should be compared at the bits-per-character level for fair cross-tokeniser comparison. Always specify the tokeniser.

Perplexity measures predictive log-likelihood on the test set, it does not measure usefulness, helpfulness, factuality, reasoning, or many other capabilities relevant for downstream tasks. A model with lower perplexity on web text is not necessarily better at answering questions or writing code.

Perplexity is sensitive to domain mismatch: a model trained on news will have high perplexity on Twitter or on code, even if it's a "good" general model.

Bits-per-character (BPC) and bits-per-byte (BPB) are tokeniser-invariant alternatives, particularly useful for comparing across vocabulary choices:

$$\mathrm{BPC} = -\frac{1}{C} \sum_t \log_2 p_\theta(x_t | x_{\lt t})$$

where $C$ is the total number of characters.

Bits per word: perplexity converts to bits per word as $\log_2 \mathrm{PPL}$, useful for information-theoretic comparison.

Held-out test set requirement: perplexity must be evaluated on data the model has not seen. With web-scale models trained on all available data, ensuring true held-out evaluation is increasingly difficult and is a current methodological concern.

Video

Related terms: Cross-Entropy Loss, Language Model, Shannon Entropy

Discussed in:

Chapter 12: Sequence Models, Sequence Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.