Probability: 4.7   Expectation, variance, covariance

Dr Chris Paton

4.7 Expectation, variance, covariance

A probability distribution is a complete object. It tells you, in principle, everything there is to know about a random quantity: how likely each possible value is, how the values cluster, where the rare events live. But complete information is unwieldy. When you train a model, debug a loss curve, or read a paper, you rarely want the whole distribution. You want a few well-chosen numbers that summarise it. The three most important summaries are the expectation, the variance, and the covariance. Together they answer three questions: where is the centre, how spread out is it, and how do two quantities move together?

These quantities are not abstract conveniences. They appear in every loss function (the mean squared error is an expectation), in every regulariser (variance penalties tame overfitting), and in every analysis of training dynamics (gradient noise has a mean and a covariance). If you understand expectation, variance, and covariance properly, half of the formal machinery of machine learning becomes transparent.

This section extracts numerical summaries from the distributions of §§4.4–4.6. These summaries are the basic vocabulary of every later chapter.

A useful way to picture what we are about to do is to imagine a probability distribution as a landscape and the summary statistics as the small set of measurements a surveyor would take. The expectation is the location of the peak's centre of mass; the variance is how broadly the landscape spreads on either side; the covariance describes how two such landscapes drift in tandem when laid side by side. None of these numbers reproduces the full topography, but together they capture the features you need most often, and they are the only features that survive the algebraic shuffling that goes on inside loss functions, gradient updates, and Bayesian posteriors.

Symbols Used Here

$X, Y$random variables

$\mathbb{E}[X]$expectation

$\text{Var}(X)$variance

$\sigma$standard deviation, $\sqrt{\text{Var}(X)}$

$\text{Cov}(X, Y)$covariance

$\rho_{X,Y}$correlation coefficient

$f(X)$function of $X$

Expectation

The expectation of a random variable, written $\mathbb{E}[X]$, is its long-run average. If you sampled $X$ a million times and took the arithmetic mean of the samples, you would get a number very close to $\mathbb{E}[X]$. As the number of samples grows without bound, the agreement becomes exact. This is why the expectation is also called the mean: it is the value around which observed averages settle.

For a discrete random variable taking values $x$ with probability $P(X = x)$, the expectation is a weighted sum:

$$ \mathbb{E}[X] = \sum_x x \, P(X = x). $$

Each possible outcome contributes its value, weighted by how often it occurs. For a continuous random variable with density $p(x)$, sums become integrals:

$$ \mathbb{E}[X] = \int x \, p(x) \, dx. $$

Geometrically, the expectation is the centre of mass of the distribution. If you imagined the probability density as a thin sheet of metal of varying thickness, the expectation is the point at which the sheet would balance on a knife edge.

Worked example: the fair die. Let $X$ be the outcome of rolling a fair six-sided die, so $X$ is uniform on $\{1, 2, 3, 4, 5, 6\}$ and each value has probability $1/6$. Then

$$ \mathbb{E}[X] = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = \frac{21}{6} = 3.5. $$

Notice that 3.5 is not a value the die can ever produce. The expectation is a property of the distribution, not a particular outcome. A long run of rolls will average out near 3.5, but no single roll lands there.

Linearity of expectation. One of the deepest and most useful facts in all of probability is that the expectation operator is linear:

$$ \mathbb{E}[aX + bY + c] = a \, \mathbb{E}[X] + b \, \mathbb{E}[Y] + c. $$

Constants pull out, sums distribute. No independence assumption is required. This holds even when $X$ and $Y$ are heavily correlated, even when they are deterministic functions of one another. Linearity is what lets us decompose complicated expectations into sums of simple ones, and it is the workhorse of nearly every calculation in this chapter.

Expectation of a function. If $f$ is any function, the expectation of $f(X)$ is computed by re-weighting the outputs of $f$ by the probabilities of the inputs:

$$ \mathbb{E}[f(X)] = \sum_x f(x) \, P(X = x). $$

For the fair die,

$$ \mathbb{E}[X^2] = \frac{1 + 4 + 9 + 16 + 25 + 36}{6} = \frac{91}{6} \approx 15.17. $$

Note that $\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2$: squaring is non-linear, and squaring before averaging is not the same as averaging before squaring. We will exploit this gap in a moment to define variance.

Variance

The expectation tells you where the distribution sits. It says nothing about how spread out it is. Two distributions can share the same mean and look completely different: one might be sharply concentrated at the mean, the other diffuse across a wide range. We need a second number to capture spread. That number is the variance.

The variance is the expected squared deviation from the mean:

$$ \text{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. $$

The two forms are algebraically identical; the right-hand form, often called the computational formula, is usually easier in practice because it requires only $\mathbb{E}[X]$ and $\mathbb{E}[X^2]$.

Squaring the deviation does two things. It makes every contribution non-negative, so positive and negative deviations do not cancel. And it punishes large deviations disproportionately, so the variance is sensitive to outliers. Because the squared deviation has units of $X^2$, we often quote the standard deviation

$$ \sigma_X = \sqrt{\text{Var}(X)} $$

instead, which has the same units as $X$ itself and is directly comparable to the mean.

Worked example: the fair die. We already computed $\mathbb{E}[X] = 3.5$ and $\mathbb{E}[X^2] \approx 15.17$. So

$$ \text{Var}(X) = 15.17 - (3.5)^2 = 15.17 - 12.25 \approx 2.92, $$

and the standard deviation is $\sigma_X = \sqrt{2.92} \approx 1.71$. A typical roll lands within about 1.7 of the mean of 3.5, which matches the intuition that most rolls are 2, 3, 4 or 5 and the extremes 1 and 6 are less central.

Properties of variance. Variance behaves quite differently from expectation under linear transformations:

$\text{Var}(aX + b) = a^2 \, \text{Var}(X)$. Adding a constant shifts the distribution but does not change its spread, so $b$ disappears. Multiplying by a constant scales every deviation by $a$, and squaring sends $a$ to $a^2$.
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X, Y)$ in general.
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ when $X$ and $Y$ are independent.

The independence case explains why averaging reduces noise: the variance of the mean of $n$ independent copies of $X$ is $\text{Var}(X)/n$, so the standard deviation falls as $1/\sqrt n$. This $1/\sqrt n$ rate governs everything from Monte Carlo estimation to mini-batch gradient descent. It is the reason large batch sizes give smoother gradient estimates, and it is the reason that quadrupling your sample size only halves the noise, a fact that has cost machine-learning practitioners many compute hours of disappointment.

Covariance and correlation

Expectation summarises one variable. Variance summarises one variable's spread. To describe how two variables move together, we need a third quantity: the covariance.

$$ \text{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y], $$

where $\mu_X = \mathbb{E}[X]$ and $\mu_Y = \mathbb{E}[Y]$. The covariance is positive when $X$ and $Y$ tend to be on the same side of their means together; negative when they tend to be on opposite sides; and zero when there is no linear association.

Properties.

$\text{Cov}(X, X) = \text{Var}(X)$. Variance is the special case of covariance with itself.
$\text{Cov}(aX + b, cY + d) = ac \, \text{Cov}(X, Y)$. Constants $b$ and $d$ shift the means but not the joint behaviour, so they vanish; multiplicative constants pull out.
If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$. The converse is false: zero covariance does not imply independence. A classic counter-example is $X \sim \mathcal{N}(0, 1)$ with $Y = X^2$, which gives $\text{Cov}(X, Y) = 0$ even though $Y$ is a deterministic function of $X$. Covariance captures only the linear part of dependence; non-linear association can hide from it entirely.

Correlation. Covariance has awkward units: if $X$ is a height in metres and $Y$ a weight in kilograms, then $\text{Cov}(X, Y)$ is in kilogram-metres, which tells you nothing about how strong the relationship is. The fix is to standardise. The Pearson correlation coefficient divides covariance by the product of standard deviations:

$$ \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y} \in [-1, 1]. $$

It is dimensionless and bounded. A correlation of $+1$ means $Y$ is an exact increasing linear function of $X$; $-1$ means an exact decreasing linear function; $0$ means no linear association at all.

Worked example: spam features. In a spam classifier, let $X$ be the count of the word "free" in an email and $Y$ the count of "click". Across a corpus of marketing emails we might find $\rho_{X,Y} \approx 0.4$. Both words are weak indicators of spam, and emails that contain one tend to contain the other, but neither uniquely determines the other. The positive correlation tells you the two features carry overlapping information, which matters when designing a classifier: you cannot treat them as independent evidence. A naive Bayes model that assumes independence will systematically over-count the joint signal; a logistic regression with both features as inputs will instead spread the weight between them. Knowing that $\rho > 0$ is what allows you to predict, before training, which model will be miscalibrated and which will not.

The covariance matrix

Real models do not deal with one or two variables; they deal with hundreds, thousands, sometimes billions. For a vector-valued random variable $\mathbf{X} = (X_1, \ldots, X_d)$, the natural extension of variance is the covariance matrix, a $d \times d$ object $\boldsymbol{\Sigma}$ with entries

$$ \Sigma_{ij} = \text{Cov}(X_i, X_j). $$

The diagonal entries $\Sigma_{ii} = \text{Var}(X_i)$ record each variable's spread. The off-diagonal entries $\Sigma_{ij}$ record how each pair varies together.

Two structural facts about $\boldsymbol{\Sigma}$ are essential. First, it is symmetric: $\Sigma_{ij} = \Sigma_{ji}$ because covariance does not care about argument order. Second, it is positive-semi-definite: $\mathbf{a}^\top \boldsymbol{\Sigma} \mathbf{a} \geq 0$ for every vector $\mathbf{a}$, because that quadratic form equals the variance of the linear combination $\mathbf{a}^\top \mathbf{X}$, and variance cannot be negative.

These two properties unlock the spectral toolkit from Chapter 2. A symmetric positive-semi-definite matrix has real, non-negative eigenvalues and an orthogonal basis of eigenvectors. Applied to the covariance matrix, the eigenvectors are the principal directions of variation in the data, and the eigenvalues are the variances along each direction. This is exactly the construction behind principal component analysis (§2.8): PCA diagonalises the covariance matrix and keeps the directions of largest variance. The covariance matrix is also the parameter that shapes the multivariate Gaussian (§4.10), determines the geometry of whitening transforms, and controls the curvature term in natural-gradient optimisation.

Conditional expectation

Sometimes you do not want the global average of $X$; you want the average given that you have observed $Y$. This is the conditional expectation $\mathbb{E}[X \mid Y]$. Crucially, $\mathbb{E}[X \mid Y]$ is itself a random variable: it is a function of $Y$, and so its value depends on which value of $Y$ happens to occur.

The fundamental identity is the law of total expectation, also called the tower rule:

$$ \mathbb{E}\!\left[\mathbb{E}[X \mid Y]\right] = \mathbb{E}[X]. $$

Reading from the inside out: first compute the conditional mean of $X$ given each value of $Y$; then average those conditional means over the distribution of $Y$; you recover the unconditional mean. The tower rule lets you split a hard expectation into an inner conditioning step and an outer averaging step, which is often vastly easier than tackling the joint distribution directly.

Conditional expectation is everywhere in machine learning. In dynamic programming, the value function of a state is the expected return conditional on starting there. In reinforcement learning, the action-value function $Q(s, a) = \mathbb{E}[\text{return} \mid s, a]$ is exactly a conditional expectation, and the Bellman equation is a recursive application of the tower rule. In the EM algorithm, the E-step computes the conditional expectation of the log-likelihood with respect to the latent variables, and the M-step maximises it. Whenever a problem has hidden structure that you cannot observe directly, conditional expectation is the tool that lets you reason about it.

Where these appear in AI

The summaries of this section are not optional vocabulary; they are the building blocks of nearly every objective and analysis you will meet.

Mean squared error loss. The MSE objective is the expectation $\mathbb{E}[(y - \hat{y})^2]$, an expectation of a squared deviation. Training minimises an empirical estimate of it.
Variance reduction in policy gradient methods. Reinforcement-learning gradients are notoriously noisy. Subtracting a baseline that has zero conditional mean leaves the gradient unbiased while reducing its variance, often dramatically. The whole theory of actor-critic methods rests on this trick.
Covariance matrices in PCA. Principal components are the eigenvectors of the data covariance matrix; the explained variance along each component is the corresponding eigenvalue. PCA is variance, dressed in linear algebra.
Bias-variance decomposition. Generalisation error decomposes additively into a squared bias term and a variance term (plus irreducible noise). This decomposition is the conceptual frame for understanding overfitting, regularisation, and the double-descent phenomenon.
Gaussian likelihoods. Multivariate Gaussians are parametrised entirely by a mean vector and a covariance matrix. Variational autoencoders, Bayesian linear regression, Gaussian processes, and Kalman filters all live inside this two-parameter family.

What you should take away

Expectation is the centre of mass. It is linear, requires no independence assumption, and is the value around which long-run averages settle.
Variance is the expected squared deviation. It measures spread, scales as $a^2$ under multiplication by $a$, and adds across independent variables, which is why averaging reduces noise as $1/\sqrt n$.
Covariance measures linear co-variation. Independence implies zero covariance, but zero covariance does not imply independence; non-linear dependence can hide from it.
The covariance matrix is symmetric and positive-semi-definite. Its eigenvectors are principal directions of variation; its eigenvalues are the variances along them. This is the bridge from probability into the linear algebra of Chapter 2.
Conditional expectation is itself a random variable, and the tower rule $\mathbb{E}[\mathbb{E}[X \mid Y]] = \mathbb{E}[X]$ is the core identity of dynamic programming, reinforcement learning value functions, and the EM algorithm.