Entropy, introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication", quantifies the average surprise or uncertainty in a probability distribution. It is the foundation of information theory and one of the most important quantities in modern AI: every classification model trained today minimises a form of entropy.
Definition
For a discrete random variable $X$ taking values in $\mathcal{X}$ with probability mass function $p$:
$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = \mathbb{E}_{x \sim p}[-\log p(x)].$$
With $\log$ base 2, entropy is measured in bits; with base $e$, in nats; with base 10, in dits. The convention $0 \log 0 = 0$ handles zero-probability outcomes.
For a continuous random variable with density $p(x)$, the differential entropy is
$$h(X) = -\int p(x) \log p(x) \, dx.$$
Differential entropy can be negative and behaves differently from discrete entropy, it is not invariant under change of variables, but most operational properties carry over.
Limit cases
- Maximum entropy. A uniform distribution over $K$ outcomes has entropy $\log K$, maximum uncertainty given the support size.
- Zero entropy. A distribution concentrated on a single outcome has entropy zero, no uncertainty.
- Maximum entropy under constraints. Among all continuous distributions with a given mean and variance, the Gaussian has maximum differential entropy $\frac{1}{2} \log(2\pi e \sigma^2)$. This is one of several principled justifications for Gaussian assumptions.
Operational interpretation: optimal coding
Shannon's source coding theorem gives entropy a concrete operational meaning: $H(X)$ is the average number of bits needed to encode messages from $p$ using an optimal lossless code. Optimal codes (Huffman, arithmetic) approach this bound. This connects information theory to data compression: the entropy of English text in bits per character (around 1.0-1.5 bits, far below the 8-bit byte representation) measures how compressible it is.
Cross-entropy and KL divergence
Two quantities derived from entropy dominate machine learning:
The cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ measures the cost of encoding data drawn from $p$ using a code optimised for $q$. Minimising cross-entropy between the empirical distribution and the model is the standard loss for classification, equivalent to maximum-likelihood estimation under a multinomial likelihood. The "log-loss" or "logistic loss" is just binary cross-entropy.
The Kullback-Leibler divergence $D_{\mathrm{KL}}(p \parallel q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$ measures the inefficiency of using $q$ instead of $p$. The relations
$$H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q), \qquad D_{\mathrm{KL}}(p \parallel q) \geq 0$$
show that minimising cross-entropy with $p$ fixed is equivalent to minimising KL divergence to $p$.
Mutual information
The mutual information $I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$ measures the reduction in uncertainty about $X$ from observing $Y$. It is symmetric, non-negative, and zero iff $X$ and $Y$ are independent. Mutual information underpins information bottleneck theory of representation learning and InfoNCE contrastive losses.
Entropy across AI
Entropy appears throughout AI:
- Classification loss. Cross-entropy between the empirical label distribution and the model's softmax output.
- Decision trees. Information gain = parent entropy minus weighted child entropy; ID3 and C4.5 split on the feature that maximises this.
- Reinforcement learning. Adding an entropy bonus $-\sum_a \pi(a) \log \pi(a)$ to the policy-gradient objective encourages exploration; maximum-entropy RL (Haarnoja et al., 2018) makes this central in SAC.
- Variational autoencoders. The ELBO contains the entropy of the variational posterior, regularising it towards the prior.
- Token-level uncertainty. The entropy of an LLM's next-token distribution is a calibration signal and a building block of selective generation and conformal prediction.
- Compression-based learning. Solomonoff induction, MDL, and the Hutter prize all rest on entropy as a measure of model fit.
Entropy unifies probability, coding, statistics and machine learning under a single quantitative framework, Shannon's most enduring contribution.
Video
Related terms: Cross-Entropy Loss, KL Divergence, Mutual Information, Information Theory, Softmax
Discussed in:
- Chapter 4: Probability, Mathematical Foundations
- Chapter 5: Statistics, Probability and Information