KL Divergence, Glossary, Textbook of AI

Also known as: relative entropy, KL distance

The Kullback–Leibler (KL) divergence of $q$ from $p$ measures how one probability distribution differs from another. For discrete distributions:

$$D_\mathrm{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i}.$$

For continuous distributions, the sum is replaced by an integral. KL divergence is non-negative ($D_\mathrm{KL}(p \| q) \geq 0$), zero exactly when $p = q$ almost everywhere, and asymmetric ($D_\mathrm{KL}(p \| q) \neq D_\mathrm{KL}(q \| p)$ in general). The asymmetry makes it a divergence rather than a distance metric.

KL divergence relates to other quantities: cross-entropy is $H(p, q) = H(p) + D_\mathrm{KL}(p \| q)$, so minimising cross-entropy with $p$ fixed minimises KL divergence. Mutual information is $I(X; Y) = D_\mathrm{KL}(P_{XY} \| P_X P_Y)$.

KL divergence is everywhere in modern AI: Variational inference minimises $D_\mathrm{KL}(q \| p)$ for an approximate posterior $q$ matching a true posterior $p$; Variational autoencoders include $-D_\mathrm{KL}(q_\phi(z|x) \| p(z))$ in the ELBO as a regulariser pulling the encoder posterior towards the prior; Reinforcement learning from human feedback (RLHF) regularises the policy with $-\beta D_\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_\mathrm{ref}(\cdot|x))$ to prevent it drifting too far from the SFT reference; Knowledge distillation trains a student model by minimising $D_\mathrm{KL}(p_\mathrm{teacher} \| p_\mathrm{student})$, making the student match the teacher's full output distribution rather than just hard labels.

The two directions $D_\mathrm{KL}(p \| q)$ (M-projection, mode-covering) and $D_\mathrm{KL}(q \| p)$ (I-projection, mode-seeking) have qualitatively different behaviour and the choice between them is a recurring modelling decision.

Video

Discussed in:

Chapter 4: Probability, Information Theory

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.