4.11 Information theory
Information theory is the mathematics of surprise. When something certain happens, the sun rises, a coin you have already seen lands the way you saw it land, you learn nothing new. When something unexpected happens, a fair coin lands heads ten times in a row, a routine blood test comes back wildly abnormal, you learn a great deal. Information theory turns this everyday intuition into numbers. It gives us a way to measure how much uncertainty lives inside a probability distribution, how different two distributions are from each other, and how much knowing one variable tells you about another.
This matters in artificial intelligence because almost every loss function you will meet, and a surprising number of training tricks, are built from these three quantities. The cross-entropy loss that trains every classifier in this textbook is information theory. The Kullback-Leibler penalty that keeps a reinforcement-learning policy from drifting too far from its starting point is information theory. The mutual-information objectives behind self-supervised representation learning, the variational lower bounds that make latent-variable models trainable, the temperature schedules in knowledge distillation, all information theory. Once you internalise entropy, KL divergence, and mutual information, large parts of deep learning stop looking like a bag of tricks and start looking like one consistent story.
We build directly on §4.5 (distributions), §4.6 (joint and conditional distributions), and §4.7 (expectations). The new ingredient is that we now compute functions of distributions, single numbers that summarise an entire PMF or PDF, and then use those numbers as objectives to optimise.
Entropy
For a discrete random variable $X$ with PMF $P$, the Shannon entropy is $$ H(X) = -\sum_x P(x) \log P(x) = \mathbb{E}[-\log P(X)]. $$ Read this as the average surprise of $X$. The surprise of a single outcome $x$ is $-\log P(x)$: rare outcomes have large surprise, certain outcomes have zero surprise. Entropy is the long-run average of that surprise when we draw repeatedly from $P$. The choice of logarithm is conventional. Use $\log_2$ and the units are bits (one bit is one yes/no question's worth of uncertainty). Use $\ln$ and the units are nats. Conversion is just a constant factor: one nat is $1/\ln 2 \approx 1.443$ bits.
A fair coin gives the cleanest example. With $P(H) = P(T) = 0.5$, $$ H(X) = -0.5 \log_2 0.5 - 0.5 \log_2 0.5 = 0.5 + 0.5 = 1 \text{ bit}. $$ One bit is exactly right: a single yes/no question ("did it land heads?") resolves the uncertainty. Now bias the coin so $P(H) = 0.9$: $$ H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1 = 0.137 + 0.332 = 0.469 \text{ bits}. $$ Less than half a bit. The biased coin is much more predictable, so on average each outcome carries less news. If you guessed "heads" every time without even looking, you would be right ninety per cent of the time, so most outcomes confirm what you already expected. Push the bias further to $P(H) = 0.99$ and the entropy drops to roughly 0.081 bits; push it all the way to $P(H) = 1$ and the entropy is exactly zero, since the outcome is now certain.
Two bounds frame everything. The maximum entropy of a distribution over $K$ outcomes is $\log_2 K$, achieved by the uniform distribution: when you have no reason to prefer any outcome, every outcome is maximally surprising on average. The minimum entropy is zero, achieved by a deterministic distribution (a point mass): if you already know the outcome, observing it tells you nothing. Everything between those two extremes is a smooth function of how concentrated $P$ is.
For continuous variables we replace the sum with an integral and call the result differential entropy: $$ H(X) = -\int p(x) \log p(x)\, dx. $$ A subtlety: differential entropy can be negative. A Gaussian $\mathcal{N}(0, \sigma^2)$ has $H = \tfrac{1}{2}\log(2\pi e \sigma^2)$, which goes negative once $\sigma$ is small enough. That is fine, differential entropy is best thought of as a relative quantity rather than an absolute count of bits, and most useful information-theoretic identities (KL divergence, mutual information) are well-defined even when the individual entropies are not. A useful fact: among all continuous distributions on the real line with a given mean and variance, the Gaussian has the largest differential entropy. This is one reason Gaussians appear so often as default choices: subject only to fixing the first two moments, they are the least committal distribution you can write down.
Joint and conditional entropy
When you have two random variables together, two natural extensions appear. The joint entropy of $X$ and $Y$ is the entropy of the pair as one combined object: $$ H(X, Y) = -\sum_{x, y} P(x, y) \log P(x, y). $$ The conditional entropy is the average remaining uncertainty in $Y$ once you know $X$: $$ H(Y \mid X) = -\sum_{x, y} P(x, y) \log P(y \mid x). $$ These two quantities are bound together by the chain rule of entropy: $$ H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y). $$ The total uncertainty in the pair equals the uncertainty in one variable plus the remaining uncertainty in the other once you know the first. This mirrors the chain rule for probabilities, $P(x, y) = P(x) P(y \mid x)$, which is no coincidence, taking $-\log$ of that identity and then averaging is exactly what gives the entropy chain rule.
Two consequences are worth remembering. First, conditioning never increases entropy on average: $H(Y \mid X) \leq H(Y)$. Knowing more cannot make you more uncertain about $Y$ in expectation, although a particular value of $X$ can certainly do so. Second, if $X$ and $Y$ are independent, knowing $X$ tells you nothing about $Y$, so $H(Y \mid X) = H(Y)$ and the chain rule collapses to $H(X, Y) = H(X) + H(Y)$. Independence means entropies simply add.
A small example fixes these ideas. Suppose $X$ is the suit of a card drawn at random from a standard pack and $Y$ is its colour (red or black). Then $H(X) = \log_2 4 = 2$ bits and $H(Y) = 1$ bit. Once you know $X$, say, hearts, the colour is fully determined, so $H(Y \mid X) = 0$ and the chain rule gives $H(X, Y) = H(X) + 0 = 2$ bits, the same as $H(X)$ on its own because $Y$ is a deterministic function of $X$. Going the other way, knowing $Y = \text{red}$ leaves two equally likely suits, so $H(X \mid Y) = 1$ bit and again $H(X, Y) = H(Y) + 1 = 2$ bits. Both decompositions agree, as they must.
Mutual information
The conditional entropy gives us a way to ask: how much did learning $X$ reduce my uncertainty about $Y$? That reduction is the mutual information: $$ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y). $$ All three forms are equivalent, pick whichever is easiest to compute. The first reads "uncertainty in $X$ minus the leftover uncertainty in $X$ once you know $Y$"; the third uses the chain rule to rewrite that as "individual entropies minus joint entropy".
Mutual information is symmetric: $I(X; Y) = I(Y; X)$. The amount $X$ tells you about $Y$ equals the amount $Y$ tells you about $X$, even though the two conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$ differ in general. It is also non-negative, with $I(X; Y) = 0$ if and only if $X$ and $Y$ are independent. Unlike Pearson correlation, which catches only linear relationships, mutual information detects any statistical dependence, including curved or many-to-one ones, a sine relationship with zero correlation can still have large $I$.
For the bivariate Gaussian with correlation $\rho$, mutual information has a clean closed form: $I(X; Y) = -\tfrac{1}{2}\log(1 - \rho^2)$. With $\rho = 0.9$ this is about 0.83 nats or 1.20 bits; with $\rho = 0$ it is exactly zero, as it must be since uncorrelated Gaussians are independent.
In AI you will meet mutual information in three main settings. Feature selection ranks candidate features by $I(\text{feature}; \text{label})$, picking those most informative about the target. Self-supervised learning, especially the InfoNCE objective behind contrastive methods like SimCLR and CLIP, trains representations to maximise a lower bound on the mutual information between two views of the same example. The information bottleneck formulation of representation learning argues that a good representation $T$ of input $X$ for predicting label $Y$ should minimise $I(X; T)$ (compress) while maximising $I(T; Y)$ (preserve task-relevant signal), a clean theoretical lens on what a deep network's hidden layers ought to be doing.
KL divergence
The Kullback-Leibler divergence measures how different two distributions over the same space are: $$ D_{\mathrm{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right]. $$ Read it as the expected log-ratio between $p$ and $q$, taken under $p$. An equivalent and useful interpretation: KL is the extra cost (in nats or bits) of describing samples from $p$ using a code optimised for $q$ instead of one optimised for $p$.
KL divergence has three properties you should commit to memory. It is non-negative for any pair of distributions, a consequence of Gibbs' inequality (which itself follows from Jensen applied to the concave $\log$). It is zero if and only if $p$ equals $q$ almost everywhere. And it is not symmetric: $D_{\mathrm{KL}}(p \| q) \neq D_{\mathrm{KL}}(q \| p)$ in general, so KL is not a metric, there is no triangle inequality, and "the KL between $p$ and $q$" is ambiguous unless you say which way round.
The asymmetry is not a quirk; it has real consequences for what kind of approximate distribution $q$ you end up with when you minimise KL.
Forward KL, $D_{\mathrm{KL}}(p \| q)$, is mean-seeking or mode-covering. The integrand contains $\log q(x)$ weighted by $p(x)$, so wherever $p$ has appreciable mass, $q$ must also have mass, otherwise $\log q(x) \to -\infty$ and the divergence blows up. The result is a $q$ that spreads out to cover every mode of $p$, even at the cost of putting mass between modes where $p$ has little or none. Maximum-likelihood training, which minimises $D_{\mathrm{KL}}(p_{\text{data}} \| p_\theta)$, is forward KL, which is why MLE-trained generative models can produce blurry "averages" rather than sharp samples.
Reverse KL, $D_{\mathrm{KL}}(q \| p)$, is mode-seeking. Now the integrand has $\log p(x)$ weighted by $q(x)$, so wherever $p$ is small, $q$ had better also be small, putting mass where $p \approx 0$ is heavily penalised. The result is a $q$ that locks onto a single mode and ignores the others. Variational inference uses reverse KL, which is why variational posteriors famously underestimate uncertainty.
A small worked example. Let $p = (0.5, 0.5)$ and $q = (0.9, 0.1)$, in nats: $$ D_{\mathrm{KL}}(p \| q) = 0.5 \ln \frac{0.5}{0.9} + 0.5 \ln \frac{0.5}{0.1} = -0.294 + 0.804 = 0.510 \text{ nats}. $$ The first term is negative because $p(x) < q(x)$ at $x = 1$, but the second term, where $q$ assigns very little mass to an outcome that $p$ thinks is just as likely as the other, dominates, and the total is comfortably positive. If you swap the roles you get a different number, illustrating the asymmetry concretely: $D_{\mathrm{KL}}(q \| p) = 0.9 \ln(0.9/0.5) + 0.1 \ln(0.1/0.5) = 0.529 + (-0.161) = 0.368$ nats.
KL divergence is the workhorse of modern AI. Variational inference trains a tractable family $q_\phi$ to approximate an intractable posterior by minimising reverse KL. Reinforcement learning from human feedback (RLHF), used to align LLMs, adds a KL penalty $\beta D_{\mathrm{KL}}(\pi_\theta \| \pi_{\text{ref}})$ to the reward so the policy cannot drift too far from the supervised reference model, this is the heart of PPO and GRPO objectives. Knowledge distillation trains a small student network by minimising $D_{\mathrm{KL}}(p_{\text{teacher}} \| p_{\text{student}})$ over softened softmax distributions. The original GAN analysis shows the discriminator-optimal generator loss reduces to a Jensen-Shannon divergence (a symmetrised cousin of KL).
A second worked example reinforces the asymmetry through a more concrete lens. Imagine the true distribution of weather on a given day in Auckland is $p = (0.6, 0.4)$ over (rain, no rain), and a forecaster's predicted distribution is $q = (0.3, 0.7)$. Forward KL is $0.6 \ln(0.6/0.3) + 0.4 \ln(0.4/0.7) = 0.416 - 0.224 = 0.192$ nats, modest but non-zero, picking up that the forecaster underweights rain. Reverse KL is $0.3 \ln(0.3/0.6) + 0.7 \ln(0.7/0.4) = -0.208 + 0.392 = 0.184$ nats. Different number, same direction of disagreement.
Cross-entropy
Cross-entropy between distributions $p$ and $q$ is $$ H(p, q) = -\sum_x p(x) \log q(x). $$ The bridge to KL is one line of algebra: $$ H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q). $$ When $p$ is fixed (it usually is, $p$ comes from data), $H(p)$ is a constant, so minimising cross-entropy with respect to $q$ is exactly the same as minimising forward KL. Cross-entropy is therefore not a different idea but a convenient computational form: you do not need to know or estimate $H(p)$ to optimise $q$.
In a classification problem the empirical $p$ for one training example is a one-hot vector, all mass on the true class, and $q$ is the model's softmax output. Suppose the true class is class 1 out of three, so $p = (1, 0, 0)$, and the model predicts $q = (0.7, 0.2, 0.1)$. Then
$$
H(p, q) = -1 \cdot \ln 0.7 - 0 \cdot \ln 0.2 - 0 \cdot \ln 0.1 = -\ln 0.7 \approx 0.357 \text{ nats}.
$$
Two-thirds of the term vanish because $p$ only has mass on the correct class. The standard per-example cross-entropy loss collapses to $-\log q_y$ where $y$ is the true label, exactly the negative log-likelihood. Average this over a mini-batch and you have the loss every classifier in this book minimises. Every time you call nn.CrossEntropyLoss in PyTorch or tf.keras.losses.CategoricalCrossentropy in TensorFlow, you are minimising forward KL between the empirical label distribution and your model's softmax.
A useful corollary: the cross-entropy loss is bounded below by $H(p)$, the entropy of the labels themselves. For one-hot labels $H(p) = 0$, so a perfect classifier really can drive the loss to zero in principle. For soft labels, for example after label smoothing, where each one-hot vector is mixed with a small uniform component, $H(p) > 0$, and the loss will plateau at $H(p)$ even for a perfect model. This is why label smoothing is sometimes phrased as "preventing the model from becoming overconfident": you have built a positive floor into the loss that it cannot squeeze through by sharpening predictions.
Other divergences
KL is the most important divergence in AI but it is not the only one, and its asymmetry and infinite values when supports do not match motivate alternatives.
The Jensen-Shannon divergence is a symmetrised, smoothed version of KL: $\mathrm{JSD}(p, q) = \tfrac{1}{2} D_{\mathrm{KL}}(p \| m) + \tfrac{1}{2} D_{\mathrm{KL}}(q \| m)$ where $m = \tfrac{1}{2}(p + q)$ is the average of the two distributions. JSD is bounded above by $\log 2$, is symmetric, and its square root is a proper metric. The original GAN paper showed that the optimal discriminator turns the generator objective into minimising JSD against the data distribution.
Wasserstein distance, also called earth mover's distance, asks how much "work" it takes to transform one distribution into the other under an optimal transport plan. Unlike KL, Wasserstein remains finite and well-behaved even when $p$ and $q$ have non-overlapping supports, exactly the situation where KL explodes. Wasserstein-1 is the metric used in WGAN, where it gives smoother gradients and more stable training than the original JSD-based objective.
The f-divergences are a unifying family: for any convex $f$ with $f(1) = 0$, $$ D_f(p \| q) = \sum_x q(x) f\!\left(\frac{p(x)}{q(x)}\right). $$ KL corresponds to $f(u) = u \log u$. Other choices give total variation distance, Hellinger distance, $\chi^2$ divergence, and reverse KL. f-GAN exploits this family directly to build GANs around any divergence in the class.
What you should take away
Entropy measures average uncertainty. It is bounded between zero (deterministic) and $\log K$ (uniform over $K$ outcomes), and the units depend only on which logarithm you use.
Mutual information measures statistical dependence, not just linear correlation, and is symmetric and non-negative, zero exactly when the two variables are independent.
KL divergence is asymmetric and non-negative, and the choice of forward vs reverse KL determines whether your approximation will be mode-covering or mode-seeking.
Cross-entropy is forward KL plus a constant, which is why minimising the standard classification loss is equivalent to fitting a model distribution to the empirical label distribution.
These quantities show up everywhere in modern AI: cross-entropy in every classifier, KL penalties in RLHF, mutual information in self-supervised learning, JSD and Wasserstein in GANs, distillation as a teacher-student KL. Once you see the pattern, the loss landscape of deep learning becomes one coherent topic rather than a list of tricks.