Glossary

Expectation

The expectation (also called the mean, expected value, or first moment) of a random variable $X$ is its probability-weighted average. For a discrete random variable taking values $x_1, x_2, \ldots$ with probabilities $p(x_i)$,

$$\mathbb{E}[X] = \sum_i x_i \, p(x_i),$$

and for a continuous random variable with density $f(x)$,

$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx,$$

provided the sum or integral converges absolutely. Intuitively, the expectation is the centre of mass of the distribution: if one were to repeat an experiment infinitely many times and average the outcomes, the result would converge to $\mathbb{E}[X]$ by the law of large numbers. The expectation need not be an attainable value of $X$ itself, the expected number of pips on a fair die is $3.5$, even though no face shows $3.5$.

Linearity

Expectation is a linear operator:

$$\mathbb{E}[aX + bY + c] = a \, \mathbb{E}[X] + b \, \mathbb{E}[Y] + c,$$

regardless of whether $X$ and $Y$ are independent. This linearity is among the most useful facts in probability, it underpins the unbiasedness of the sample mean as an estimator of the population mean, decomposes the bias–variance trade-off, and licenses the clean analyses of randomised algorithms (e.g. the linearity-of-expectation proof that the expected number of fixed points of a random permutation is $1$, regardless of $n$).

The law of the unconscious statistician (LOTUS) extends expectation to functions: $\mathbb{E}[g(X)] = \sum_x g(x) p(x)$ for any (measurable) function $g$. Higher moments are expectations of powers: $\mathbb{E}[X^2]$ is the second moment, the variance is $\mathrm{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$, and the moment-generating function $M_X(t) = \mathbb{E}[e^{tX}]$ encodes all moments at once.

Conditional expectation

The conditional expectation $\mathbb{E}[X \mid Y]$ is itself a random variable, a function of $Y$ that, on the event $Y = y$, equals $\mathbb{E}[X \mid Y = y]$. The tower property $\mathbb{E}[\mathbb{E}[X \mid Y]] = \mathbb{E}[X]$ underlies the EM algorithm, importance sampling, and the Bellman equation in reinforcement learning.

In machine learning

Loss functions are almost always defined as expectations. The expected risk is

$$R(f) = \mathbb{E}_{(X,Y) \sim \mathcal{D}}[L(Y, f(X))],$$

and training a model is the task of minimising an empirical risk estimate of this quantity over a finite training set. Stochastic gradient descent approximates the expected gradient by averaging over mini-batches; the law of large numbers ensures that with enough samples, the estimate becomes accurate, while the central limit theorem characterises the noise as roughly Gaussian. The REINFORCE gradient estimator in reinforcement learning, the evidence lower bound (ELBO) in variational inference, the entropy $H(p) = -\mathbb{E}[\log p(X)]$ and KL divergence of an information-theoretic objective, and the importance-weighted estimator of an off-policy value are all expectations of various quantities. Whenever a quantity in machine learning admits an unbiased Monte Carlo estimator, that quantity is, structurally, an expectation.

Related terms: Variance, Central Limit Theorem, Random Variable, Stochastic Gradient Descent, KL Divergence

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.