Glossary

KL Divergence

Also known as: relative entropy, KL distance

The Kullback–Leibler (KL) Divergence, $D_{KL}(p \parallel q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$ (or an integral for continuous distributions), measures how much one probability distribution $q$ differs from another distribution $p$. It is always non-negative (Gibbs' inequality) and equals zero if and only if $p = q$. It is not a true metric—it is asymmetric ($D_{KL}(p \parallel q) \neq D_{KL}(q \parallel p)$ in general) and does not satisfy the triangle inequality—but it nonetheless serves as the foundational measure of distributional discrepancy in machine learning.

KL divergence has an operational interpretation: it is the additional average number of bits required to encode samples from $p$ when using a code optimised for $q$ rather than $p$. This makes it the natural objective when we want to approximate an unknown true distribution with a model distribution.

In AI, KL divergence appears in countless places. Variational inference minimises KL divergence between a tractable approximate posterior and the true intractable posterior. The evidence lower bound (ELBO) maximised by variational autoencoders is equivalent to minimising a KL divergence. RLHF adds a KL penalty to prevent a policy from drifting too far from the supervised fine-tuning distribution. Policy gradient methods and distillation both rely on KL-based losses. Understanding KL divergence is essential for navigating modern probabilistic machine learning.

Related terms: Entropy, Cross-Entropy, Information Theory

Discussed in:

Also defined in: Textbook of AI