The mutual information between two random variables $X$ and $Y$ is
$$I(X; Y) \;=\; H(X) - H(X \mid Y) \;=\; H(Y) - H(Y \mid X) \;=\; \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\,p(y)}.$$
It is non-negative ($I(X; Y) \geq 0$), symmetric in $X$ and $Y$, and zero exactly when $X$ and $Y$ are statistically independent. The continuous-variable analogue replaces the sum with an integral and uses differential entropies.
Interpretation
Intuitively, $I(X; Y)$ measures how much knowing one variable reduces uncertainty about the other. Equivalently it is the Kullback–Leibler divergence of the joint distribution from the product of the marginals,
$$I(X; Y) \;=\; D_{\mathrm{KL}}\bigl(p(x, y) \,\|\, p(x)\, p(y)\bigr),$$
a direct measure of statistical dependence that, unlike Pearson correlation, captures arbitrary non-linear relationships. A correlation of zero between two variables does not imply independence, but a mutual information of zero does.
Origins
Mutual information was introduced by Claude Shannon in his 1948 paper A Mathematical Theory of Communication as the central quantity in the analysis of noisy channels. The channel capacity
$$C \;=\; \max_{p(x)} I(X; Y)$$
is the maximum mutual information between input and output achievable by any input distribution, and Shannon's celebrated channel-coding theorem shows that $C$ is exactly the highest rate at which information can be transmitted with arbitrarily low error.
Applications in machine learning
Mutual information has become central to modern machine learning:
- Decision trees use it under the name information gain as the splitting criterion in ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993).
- Feature selection algorithms rank candidate features by their mutual information with the target.
- The information bottleneck principle (Tishby, Pereira and Bialek, 1999) frames representation learning as finding a compressed representation $T$ of $X$ that maximises $I(T; Y)$ while minimising $I(T; X)$.
- Contrastive self-supervised learning objectives, InfoNCE (van den Oord et al., 2018), CPC, SimCLR, MoCo, are essentially lower bounds on mutual information between different views of the same datum.
- Independent Component Analysis (ICA) minimises mutual information between sources.
Estimation
Estimating mutual information from samples in high dimensions is genuinely hard, the empirical plug-in estimator is badly biased when the support is large. Modern methods include the MINE neural estimator (Belghazi et al., 2018), kernel-based estimators (Kraskov–Stögbauer–Grassberger), and noise-contrastive bounds such as InfoNCE.
Related terms: Entropy, KL Divergence, Cross-Entropy Loss
Discussed in:
- Chapter 6: ML Fundamentals, Information and Uncertainty