Glossary

Mutual Information

The mutual information between two random variables $X$ and $Y$ is

$$I(X; Y) \;=\; H(X) - H(X \mid Y) \;=\; H(Y) - H(Y \mid X) \;=\; \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\,p(y)}.$$

It is non-negative ($I(X; Y) \geq 0$), symmetric in $X$ and $Y$, and zero exactly when $X$ and $Y$ are statistically independent. The continuous-variable analogue replaces the sum with an integral and uses differential entropies.

Interpretation

Intuitively, $I(X; Y)$ measures how much knowing one variable reduces uncertainty about the other. Equivalently it is the Kullback–Leibler divergence of the joint distribution from the product of the marginals,

$$I(X; Y) \;=\; D_{\mathrm{KL}}\bigl(p(x, y) \,\|\, p(x)\, p(y)\bigr),$$

a direct measure of statistical dependence that, unlike Pearson correlation, captures arbitrary non-linear relationships. A correlation of zero between two variables does not imply independence, but a mutual information of zero does.

Origins

Mutual information was introduced by Claude Shannon in his 1948 paper A Mathematical Theory of Communication as the central quantity in the analysis of noisy channels. The channel capacity

$$C \;=\; \max_{p(x)} I(X; Y)$$

is the maximum mutual information between input and output achievable by any input distribution, and Shannon's celebrated channel-coding theorem shows that $C$ is exactly the highest rate at which information can be transmitted with arbitrarily low error.

Applications in machine learning

Mutual information has become central to modern machine learning:

  • Decision trees use it under the name information gain as the splitting criterion in ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993).
  • Feature selection algorithms rank candidate features by their mutual information with the target.
  • The information bottleneck principle (Tishby, Pereira and Bialek, 1999) frames representation learning as finding a compressed representation $T$ of $X$ that maximises $I(T; Y)$ while minimising $I(T; X)$.
  • Contrastive self-supervised learning objectives, InfoNCE (van den Oord et al., 2018), CPC, SimCLR, MoCo, are essentially lower bounds on mutual information between different views of the same datum.
  • Independent Component Analysis (ICA) minimises mutual information between sources.

Estimation

Estimating mutual information from samples in high dimensions is genuinely hard, the empirical plug-in estimator is badly biased when the support is large. Modern methods include the MINE neural estimator (Belghazi et al., 2018), kernel-based estimators (Kraskov–Stögbauer–Grassberger), and noise-contrastive bounds such as InfoNCE.

Related terms: Entropy, KL Divergence, Cross-Entropy Loss

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).