Shannon Entropy, Glossary, Textbook of AI

Shannon entropy is the central quantity of information theory, introduced by Claude Shannon in his 1948 paper A Mathematical Theory of Communication in the Bell System Technical Journal. For a discrete random variable $X$ taking values $x$ with probability $p(x)$, the entropy is

$$H(X) = -\sum_x p(x) \log_2 p(x),$$

measured in bits when the logarithm is base 2 (in nats with the natural logarithm, or in dits with base 10). $H(X) = \mathbb{E}[-\log_2 p(X)]$: the expected surprise on observing $X$.

Examples

A fair coin: $H = -2 \cdot \tfrac{1}{2} \log_2 \tfrac{1}{2} = 1$ bit.
A biased coin with $p = (0.9, 0.1)$: $H \approx 0.469$ bits.
A deterministic outcome: $H = 0$ (no surprise).
A uniform distribution over $n$ outcomes: $H = \log_2 n$ (the maximum).

Entropy is non-negative, concave in $p$, maximised by the uniform distribution for a fixed support, and additive for independent variables: $H(X, Y) = H(X) + H(Y)$ when $X \perp Y$.

Source coding

Shannon's source-coding theorem establishes the operational meaning of $H$: it is the minimum average number of bits per symbol needed to losslessly encode an i.i.d. sequence drawn from $p$. No compression scheme can do better in expectation; Huffman codes, arithmetic codes and modern entropy coders (range coders, ANS) approach this bound. The Kraft inequality $\sum_x 2^{-\ell(x)} \leq 1$ formalises the trade-off between codeword lengths.

Information-theoretic quantities in machine learning

Cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ is the loss function of every classification model and every language model. Minimising cross-entropy with respect to $q$ drives $q$ toward $p$, it is the negative log-likelihood up to a constant.
KL divergence $D_{\mathrm{KL}}(p \,\|\, q) = H(p, q) - H(p) = \sum_x p(x) \log\frac{p(x)}{q(x)}$ measures how far $q$ is from $p$. It is non-negative, zero iff $p = q$, and asymmetric. It underlies variational autoencoders (the ELBO), generative adversarial training (the JSD), and reinforcement-learning policy regularisation (PPO, TRPO).
Mutual information $I(X; Y) = H(X) - H(X \mid Y) = D_{\mathrm{KL}}(p_{XY} \,\|\, p_X p_Y)$ measures how much one variable tells you about another; underlies decision-tree splits (information gain), the information bottleneck framework (Tishby 1999) and many feature-selection methods.
Conditional entropy $H(X \mid Y) = H(X, Y) - H(Y)$ measures residual uncertainty in $X$ given $Y$.

Differential entropy

For continuous variables with density $p(x)$, the differential entropy is $h(X) = -\int p(x) \log p(x)\,dx$. It can be negative (a tight Gaussian has negative differential entropy) and is not invariant under change of coordinates, but plays an analogous role: a Gaussian $\mathcal{N}(\mu, \sigma^2)$ has $h = \tfrac{1}{2} \log(2 \pi e \sigma^2)$ and maximises differential entropy among all distributions with a given variance , the maximum-entropy justification for Gaussian assumptions.

Modern relevance

The training objective of every modern language model, including GPT-4, Claude and Gemini, is to minimise the cross-entropy of the model's predicted next-token distribution against the true distribution; the model's reported "loss" is precisely an entropy-rate estimate. Perplexity, $2^H$, is the exponentiated entropy. The scaling laws (Kaplan, Chinchilla) are stated in terms of cross-entropy loss as a function of parameters, data and compute.

Video

Discussed in:

Chapter 5: Statistics, Information Theory

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).