Entropy, Glossary, Textbook of AI

Entropy, introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication", quantifies the average surprise or uncertainty in a probability distribution. It is the foundation of information theory and one of the most important quantities in modern AI: every classification model trained today minimises a form of entropy.

Definition

For a discrete random variable $X$ taking values in $\mathcal{X}$ with probability mass function $p$:

$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = \mathbb{E}_{x \sim p}[-\log p(x)].$$

With $\log$ base 2, entropy is measured in bits; with base $e$, in nats; with base 10, in dits. The convention $0 \log 0 = 0$ handles zero-probability outcomes.

For a continuous random variable with density $p(x)$, the differential entropy is

$$h(X) = -\int p(x) \log p(x) \, dx.$$

Differential entropy can be negative and behaves differently from discrete entropy, it is not invariant under change of variables, but most operational properties carry over.

Limit cases

Maximum entropy. A uniform distribution over $K$ outcomes has entropy $\log K$, maximum uncertainty given the support size.
Zero entropy. A distribution concentrated on a single outcome has entropy zero, no uncertainty.
Maximum entropy under constraints. Among all continuous distributions with a given mean and variance, the Gaussian has maximum differential entropy $\frac{1}{2} \log(2\pi e \sigma^2)$. This is one of several principled justifications for Gaussian assumptions.

Operational interpretation: optimal coding

Shannon's source coding theorem gives entropy a concrete operational meaning: $H(X)$ is the average number of bits needed to encode messages from $p$ using an optimal lossless code. Optimal codes (Huffman, arithmetic) approach this bound. This connects information theory to data compression: the entropy of English text in bits per character (around 1.0-1.5 bits, far below the 8-bit byte representation) measures how compressible it is.

Cross-entropy and KL divergence

Two quantities derived from entropy dominate machine learning:

The cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ measures the cost of encoding data drawn from $p$ using a code optimised for $q$. Minimising cross-entropy between the empirical distribution and the model is the standard loss for classification, equivalent to maximum-likelihood estimation under a multinomial likelihood. The "log-loss" or "logistic loss" is just binary cross-entropy.

The Kullback-Leibler divergence $D_{\mathrm{KL}}(p \parallel q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$ measures the inefficiency of using $q$ instead of $p$. The relations

$$H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q), \qquad D_{\mathrm{KL}}(p \parallel q) \geq 0$$

show that minimising cross-entropy with $p$ fixed is equivalent to minimising KL divergence to $p$.

Mutual information

The mutual information $I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$ measures the reduction in uncertainty about $X$ from observing $Y$. It is symmetric, non-negative, and zero iff $X$ and $Y$ are independent. Mutual information underpins information bottleneck theory of representation learning and InfoNCE contrastive losses.

Entropy across AI

Entropy appears throughout AI:

Classification loss. Cross-entropy between the empirical label distribution and the model's softmax output.
Decision trees. Information gain = parent entropy minus weighted child entropy; ID3 and C4.5 split on the feature that maximises this.
Reinforcement learning. Adding an entropy bonus $-\sum_a \pi(a) \log \pi(a)$ to the policy-gradient objective encourages exploration; maximum-entropy RL (Haarnoja et al., 2018) makes this central in SAC.
Variational autoencoders. The ELBO contains the entropy of the variational posterior, regularising it towards the prior.
Token-level uncertainty. The entropy of an LLM's next-token distribution is a calibration signal and a building block of selective generation and conformal prediction.
Compression-based learning. Solomonoff induction, MDL, and the Hutter prize all rest on entropy as a measure of model fit.

Entropy unifies probability, coding, statistics and machine learning under a single quantitative framework, Shannon's most enduring contribution.

Video

Discussed in:

Chapter 4: Probability, Mathematical Foundations
Chapter 5: Statistics, Probability and Information

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.