Variance, Glossary, Textbook of AI

The Variance of a random variable measures how spread out its distribution is around its mean. Formally, for a random variable $X$ with expectation $\mu = \mathbb{E}[X]$,

$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mu^2.$$

The variance is always non-negative, and equals zero only if $X$ is almost surely constant. The standard deviation $\sigma_X = \sqrt{\text{Var}(X)}$ shares the units of $X$ and is generally preferred for human interpretation.

Algebraic properties

Unlike expectation, variance is not linear:

$\text{Var}(aX + b) = a^2 \, \text{Var}(X)$, constants shift the mean but do not affect the spread; multiplicative scaling squares.
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)$.
For independent random variables, $\text{Cov}(X, Y) = 0$, so variances add: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.

This last property underlies the fact that the variance of a sample mean of $n$ i.i.d. observations is $\sigma^2 / n$, the standard error.

Covariance and correlation

The covariance of two random variables generalises variance to the bivariate setting:

$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)].$$

Its magnitude depends on the units of $X$ and $Y$, so it is usually normalised to the Pearson correlation coefficient

$$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [-1, 1],$$

which is scale-invariant and measures linear association.

For a vector-valued random variable $\mathbf{X} = (X_1, \ldots, X_d)$, the covariance matrix is the symmetric positive semi-definite matrix $\boldsymbol\Sigma$ with entries $\Sigma_{ij} = \text{Cov}(X_i, X_j)$. It encodes all pairwise linear dependencies and is fundamental to:

Principal component analysis (PCA), eigenvectors of $\boldsymbol\Sigma$ give directions of maximum variance.
Multivariate Gaussian models, $\boldsymbol\Sigma$ parametrises the elliptical density.
Whitening transformations, $\boldsymbol\Sigma^{-1/2}$ rotates and scales data to have isotropic unit variance.

Variance in machine learning

Variance is central to the bias–variance decomposition of expected prediction error:

$$\mathbb{E}[(y - \hat f(x))^2] = \underbrace{(\mathbb{E}[\hat f(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat f(x))}_{\text{variance}} + \sigma^2_{\text{noise}}.$$

The first term measures how far the average prediction is from the truth; the second measures how much the prediction wobbles when the training set changes; the third is the irreducible noise. A model with high prediction variance, for example, an overgrown decision tree or a deep network with too few training examples, is unstable across resamples of the training data and is likely to overfit.

Many machine-learning methods can be understood as variance-reduction techniques:

Bagging averages predictions across resampled training sets, reducing variance without changing bias.
Random forests add feature subsampling to further decorrelate trees.
Cross-validation estimates predictive variance honestly by holding out test folds.
L2 regularisation (weight decay) shrinks parameter estimates, trading bias for reduced variance.

Interactive

The 68-95-99.7 rule. A Gaussian's tails fall off so fast that three standard deviations cover virtually all the probability.

Discussed in:

Chapter 5: Statistics, Probability and statistics

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.