Shannon entropy is the central quantity of information theory, introduced by Claude Shannon in his 1948 paper A Mathematical Theory of Communication in the Bell System Technical Journal. For a discrete random variable $X$ taking values $x$ with probability $p(x)$, the entropy is
$$H(X) = -\sum_x p(x) \log_2 p(x),$$
measured in bits when the logarithm is base 2 (in nats with the natural logarithm, or in dits with base 10). $H(X) = \mathbb{E}[-\log_2 p(X)]$: the expected surprise on observing $X$.
Examples
- A fair coin: $H = -2 \cdot \tfrac{1}{2} \log_2 \tfrac{1}{2} = 1$ bit.
- A biased coin with $p = (0.9, 0.1)$: $H \approx 0.469$ bits.
- A deterministic outcome: $H = 0$ (no surprise).
- A uniform distribution over $n$ outcomes: $H = \log_2 n$ (the maximum).
Entropy is non-negative, concave in $p$, maximised by the uniform distribution for a fixed support, and additive for independent variables: $H(X, Y) = H(X) + H(Y)$ when $X \perp Y$.
Source coding
Shannon's source-coding theorem establishes the operational meaning of $H$: it is the minimum average number of bits per symbol needed to losslessly encode an i.i.d. sequence drawn from $p$. No compression scheme can do better in expectation; Huffman codes, arithmetic codes and modern entropy coders (range coders, ANS) approach this bound. The Kraft inequality $\sum_x 2^{-\ell(x)} \leq 1$ formalises the trade-off between codeword lengths.
Information-theoretic quantities in machine learning
- Cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ is the loss function of every classification model and every language model. Minimising cross-entropy with respect to $q$ drives $q$ toward $p$, it is the negative log-likelihood up to a constant.
- KL divergence $D_{\mathrm{KL}}(p \,\|\, q) = H(p, q) - H(p) = \sum_x p(x) \log\frac{p(x)}{q(x)}$ measures how far $q$ is from $p$. It is non-negative, zero iff $p = q$, and asymmetric. It underlies variational autoencoders (the ELBO), generative adversarial training (the JSD), and reinforcement-learning policy regularisation (PPO, TRPO).
- Mutual information $I(X; Y) = H(X) - H(X \mid Y) = D_{\mathrm{KL}}(p_{XY} \,\|\, p_X p_Y)$ measures how much one variable tells you about another; underlies decision-tree splits (information gain), the information bottleneck framework (Tishby 1999) and many feature-selection methods.
- Conditional entropy $H(X \mid Y) = H(X, Y) - H(Y)$ measures residual uncertainty in $X$ given $Y$.
Differential entropy
For continuous variables with density $p(x)$, the differential entropy is $h(X) = -\int p(x) \log p(x)\,dx$. It can be negative (a tight Gaussian has negative differential entropy) and is not invariant under change of coordinates, but plays an analogous role: a Gaussian $\mathcal{N}(\mu, \sigma^2)$ has $h = \tfrac{1}{2} \log(2 \pi e \sigma^2)$ and maximises differential entropy among all distributions with a given variance , the maximum-entropy justification for Gaussian assumptions.
Modern relevance
The training objective of every modern language model, including GPT-4, Claude and Gemini, is to minimise the cross-entropy of the model's predicted next-token distribution against the true distribution; the model's reported "loss" is precisely an entropy-rate estimate. Perplexity, $2^H$, is the exponentiated entropy. The scaling laws (Kaplan, Chinchilla) are stated in terms of cross-entropy loss as a function of parameters, data and compute.
Video
Related terms: Cross-Entropy Loss, KL Divergence, Mutual Information, Information Theory, Perplexity
Discussed in:
- Chapter 5: Statistics, Information Theory