Perplexity is the standard evaluation metric for language models. It is defined as the exponential of the average negative log-likelihood per token: $\text{PPL} = \exp\left(-\frac{1}{N}\sum_i \log p(w_i \mid \text{context})\right)$. Intuitively, perplexity measures how "surprised" the model is by the test data. A perplexity of 50 on a dataset means the model is as uncertain as if it had to choose uniformly among 50 equally likely words at each position.
Lower perplexity is better. A perfect model that predicts the correct token with probability 1 has perplexity 1. A uniform model over a vocabulary of $V$ tokens has perplexity $V$. The progression of language model perplexity on the Penn Treebank benchmark over the decades tells a dramatic story: over 140 for well-tuned trigram models, below 60 for LSTMs with clever tricks, and into the single digits for modern transformer-based models.
Perplexity has limitations as a metric. Two models with the same perplexity may produce qualitatively different text—one fluent and boring, one erratic but creative. Perplexity also depends sensitively on the tokeniser: byte-pair encoding produces different perplexities than word-level tokenisation, making comparisons across tokenisers meaningless. Nonetheless, within a fixed tokenisation scheme, perplexity remains the gold standard for measuring how well a language model captures the statistical structure of text. It directly corresponds to cross-entropy loss and to log-likelihood, linking information theory, statistics, and language modelling.
Related terms: Language Model, Cross-Entropy, Large Language Model
Discussed in:
- Chapter 12: Sequence Models — Language Models
Also defined in: Textbook of AI