The Variance of a random variable measures how spread out its distribution is around its mean. Formally, for a random variable $X$ with expectation $\mu = \mathbb{E}[X]$,
$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mu^2.$$
The variance is always non-negative, and equals zero only if $X$ is almost surely constant. The standard deviation $\sigma_X = \sqrt{\text{Var}(X)}$ shares the units of $X$ and is generally preferred for human interpretation.
Algebraic properties
Unlike expectation, variance is not linear:
- $\text{Var}(aX + b) = a^2 \, \text{Var}(X)$, constants shift the mean but do not affect the spread; multiplicative scaling squares.
- $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)$.
- For independent random variables, $\text{Cov}(X, Y) = 0$, so variances add: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.
This last property underlies the fact that the variance of a sample mean of $n$ i.i.d. observations is $\sigma^2 / n$, the standard error.
Covariance and correlation
The covariance of two random variables generalises variance to the bivariate setting:
$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)].$$
Its magnitude depends on the units of $X$ and $Y$, so it is usually normalised to the Pearson correlation coefficient
$$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [-1, 1],$$
which is scale-invariant and measures linear association.
For a vector-valued random variable $\mathbf{X} = (X_1, \ldots, X_d)$, the covariance matrix is the symmetric positive semi-definite matrix $\boldsymbol\Sigma$ with entries $\Sigma_{ij} = \text{Cov}(X_i, X_j)$. It encodes all pairwise linear dependencies and is fundamental to:
- Principal component analysis (PCA), eigenvectors of $\boldsymbol\Sigma$ give directions of maximum variance.
- Multivariate Gaussian models, $\boldsymbol\Sigma$ parametrises the elliptical density.
- Whitening transformations, $\boldsymbol\Sigma^{-1/2}$ rotates and scales data to have isotropic unit variance.
Variance in machine learning
Variance is central to the bias–variance decomposition of expected prediction error:
$$\mathbb{E}[(y - \hat f(x))^2] = \underbrace{(\mathbb{E}[\hat f(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat f(x))}_{\text{variance}} + \sigma^2_{\text{noise}}.$$
The first term measures how far the average prediction is from the truth; the second measures how much the prediction wobbles when the training set changes; the third is the irreducible noise. A model with high prediction variance, for example, an overgrown decision tree or a deep network with too few training examples, is unstable across resamples of the training data and is likely to overfit.
Many machine-learning methods can be understood as variance-reduction techniques:
- Bagging averages predictions across resampled training sets, reducing variance without changing bias.
- Random forests add feature subsampling to further decorrelate trees.
- Cross-validation estimates predictive variance honestly by holding out test folds.
- L2 regularisation (weight decay) shrinks parameter estimates, trading bias for reduced variance.
Interactive
Related terms: Expectation, Probability Distribution, Bias-Variance Tradeoff, Principal Component Analysis
Discussed in:
- Chapter 5: Statistics, Probability and statistics