Mutual Information, Glossary, Textbook of AI

The mutual information between two random variables $X$ and $Y$ is

$$I(X; Y) \;=\; H(X) - H(X \mid Y) \;=\; H(Y) - H(Y \mid X) \;=\; \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\,p(y)}.$$

It is non-negative ($I(X; Y) \geq 0$), symmetric in $X$ and $Y$, and zero exactly when $X$ and $Y$ are statistically independent. The continuous-variable analogue replaces the sum with an integral and uses differential entropies.

Interpretation

Intuitively, $I(X; Y)$ measures how much knowing one variable reduces uncertainty about the other. Equivalently it is the Kullback–Leibler divergence of the joint distribution from the product of the marginals,

$$I(X; Y) \;=\; D_{\mathrm{KL}}\bigl(p(x, y) \,\|\, p(x)\, p(y)\bigr),$$

a direct measure of statistical dependence that, unlike Pearson correlation, captures arbitrary non-linear relationships. A correlation of zero between two variables does not imply independence, but a mutual information of zero does.

Origins

Mutual information was introduced by Claude Shannon in his 1948 paper A Mathematical Theory of Communication as the central quantity in the analysis of noisy channels. The channel capacity

$$C \;=\; \max_{p(x)} I(X; Y)$$

is the maximum mutual information between input and output achievable by any input distribution, and Shannon's celebrated channel-coding theorem shows that $C$ is exactly the highest rate at which information can be transmitted with arbitrarily low error.

Applications in machine learning

Mutual information has become central to modern machine learning:

Decision trees use it under the name information gain as the splitting criterion in ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993).
Feature selection algorithms rank candidate features by their mutual information with the target.
The information bottleneck principle (Tishby, Pereira and Bialek, 1999) frames representation learning as finding a compressed representation $T$ of $X$ that maximises $I(T; Y)$ while minimising $I(T; X)$.
Contrastive self-supervised learning objectives, InfoNCE (van den Oord et al., 2018), CPC, SimCLR, MoCo, are essentially lower bounds on mutual information between different views of the same datum.
Independent Component Analysis (ICA) minimises mutual information between sources.

Estimation

Estimating mutual information from samples in high dimensions is genuinely hard, the empirical plug-in estimator is badly biased when the support is large. Modern methods include the MINE neural estimator (Belghazi et al., 2018), kernel-based estimators (Kraskov–Stögbauer–Grassberger), and noise-contrastive bounds such as InfoNCE.

Related terms: Entropy, KL Divergence, Cross-Entropy Loss

Discussed in:

Chapter 6: ML Fundamentals, Information and Uncertainty

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).