Glossary

Entropy

Entropy, introduced by Claude Shannon in 1948, quantifies the average "surprise" or "uncertainty" in a probability distribution. For a discrete random variable $X$ with PMF $p$:

$$H(X) = -\sum_x p(x) \log p(x)$$

With log base 2, entropy is measured in bits; with base $e$, in nats. A uniform distribution over $K$ outcomes has maximum entropy $\log K$ (maximum uncertainty); a distribution concentrated on a single outcome has entropy zero (no uncertainty). Among all continuous distributions with a given mean and variance, the Gaussian has maximum differential entropy—one of several justifications for Gaussian assumptions.

Entropy has a concrete operational interpretation: it is the average number of bits needed to encode a message from the distribution using an optimal code. This connects information theory to compression. The cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ measures the cost of encoding data from $p$ using a code optimised for $q$; minimising cross-entropy is the standard loss function for classification.

Entropy also relates to KL divergence via $H(p, q) = H(p) + D_{KL}(p \parallel q)$, so minimising cross-entropy with fixed $p$ is equivalent to minimising KL divergence. Entropy appears throughout AI: in decision tree splitting (information gain), in exploration bonuses for reinforcement learning (encouraging high-entropy policies), and in the entropy regulariser for variational autoencoders.

Related terms: Cross-Entropy, KL Divergence, Information Theory

Discussed in:

Also defined in: Textbook of AI