Probability: 4.10   The multivariate Gaussian

Dr Chris Paton

4.10 The multivariate Gaussian

If you only ever learn one continuous distribution well enough to recognise it on sight, make it the multivariate Gaussian. It is the most important continuous distribution in artificial intelligence, and it is hard to find a corner of modern machine learning where it does not show up. The weights of a freshly created neural network are usually drawn from a Gaussian. The posterior in Bayesian linear regression is Gaussian. The prior we put over weights when we do MAP estimation is almost always Gaussian. The latent codes that variational autoencoders learn to produce are Gaussian. The noise that diffusion models add at every step of their forward process, and the noise they predict and subtract during sampling, is Gaussian. Gaussian processes are, well, Gaussian. Kalman filters, which are still the workhorse of robotics state estimation, propagate Gaussians. Even when our final model is something exotic, like a Transformer, the maths we use to analyse it (initialisation variance, signal propagation, neural tangent kernels) is built on multivariate Gaussian reasoning.

Why does the same distribution keep turning up? Three reasons. First, it is the maximum-entropy distribution given a fixed mean and covariance, so it is the least committal continuous distribution we can pick once we have decided to specify those two summary statistics. Second, the central limit theorem (§4.9) says sums of many small independent contributions tend to be Gaussian, and a great deal of what happens in deep learning is sums of many small contributions. Third, and most practically, the Gaussian is closed under almost every operation we like to do: marginalisation, conditioning, linear transformation, addition of independent variables. You can compose Gaussian operations all day and stay inside the Gaussian world, with closed-form formulas at every step. Almost no other continuous distribution behaves so well.

In §4.5 we met the univariate Gaussian, the bell curve with a single mean $\mu$ and variance $\sigma^2$. In this section we generalise it to vectors. Instead of a number on the real line we have a point $\mathbf{x}$ in $\mathbb{R}^d$, instead of a mean we have a mean vector $\boldsymbol{\mu}$, and instead of a variance we have a covariance matrix $\boldsymbol{\Sigma}$ that records how each pair of coordinates varies together. In §4.11 we will use the multivariate Gaussian to derive clean formulas for entropy and KL divergence, both of which collapse to neat expressions in the Gaussian case.

Symbols Used Here

$\mathbf{x}$vector in $\mathbb{R}^d$

$\boldsymbol{\mu}$mean vector

$\boldsymbol{\Sigma}$covariance matrix, $d \times d$, symmetric positive-definite

$d$dimension

$\det$$^{-1}$, $\top$, determinant, inverse, transpose

Definition

We say a random vector $\mathbf{x}$ in $\mathbb{R}^d$ is multivariate Gaussian, written $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, if its probability density function is $$p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d \det(\boldsymbol{\Sigma})}} \exp\!\left(-\tfrac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right).$$

The formula looks intimidating the first time you see it, so let us walk through it slowly. The exponential at the right is the most important bit. Inside it sits the quadratic form $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$, which is a single non-negative number that measures how far $\mathbf{x}$ is from the mean $\boldsymbol{\mu}$, where the notion of distance is reshaped by the inverse covariance $\boldsymbol{\Sigma}^{-1}$. This number is called the squared Mahalanobis distance, and it is the multivariate generalisation of the familiar quantity $(x - \mu)^2 / \sigma^2$ from the one-dimensional Gaussian. When $\mathbf{x} = \boldsymbol{\mu}$ it is zero, and the density takes its maximum. As $\mathbf{x}$ moves away in any direction, the quadratic form grows, the exponential shrinks, and the density falls off. The factor of one-half in front of the quadratic form is the same one-half that appears in the univariate case, and it is what makes the variance (rather than twice the variance) come out right.

The constant in front, $1/\sqrt{(2\pi)^d \det(\boldsymbol{\Sigma})}$, is the normalising constant. Its only job is to make the density integrate to one. The $(2\pi)^d$ factor is the multivariate version of the $\sqrt{2\pi}$ factor in one dimension. The $\det(\boldsymbol{\Sigma})$ factor accounts for how stretched or squashed the distribution is: a covariance matrix with a large determinant describes a Gaussian that is spread over a large volume, so each individual point must have a smaller density to compensate.

Five properties define the way Gaussians behave under the operations we care about, and they are worth committing to memory.

The expected value is the mean vector: $\mathbb{E}[\mathbf{x}] = \boldsymbol{\mu}$.
The covariance is the covariance matrix: $\operatorname{Cov}(\mathbf{x}) = \boldsymbol{\Sigma}$.
Marginal distributions are Gaussian. If you ignore some coordinates and only look at others, what remains is still Gaussian.
Conditional distributions are Gaussian, with closed-form mean and covariance. If you fix some coordinates to known values and ask what the others look like given that information, the answer is still Gaussian.
Linear (and more generally affine) transformations of Gaussians are Gaussian. Multiply by a matrix, add a vector, and you stay inside the family.

The covariance matrix $\boldsymbol{\Sigma}$ must be symmetric and positive-definite for the formula to make sense as a density. Symmetric because $\operatorname{Cov}(x_i, x_j) = \operatorname{Cov}(x_j, x_i)$ in any setting. Positive-definite because we need $\boldsymbol{\Sigma}^{-1}$ to exist and we need the quadratic form to be strictly positive away from the mean, otherwise the density would not integrate to a finite value. In practice, when we compute a sample covariance from data and the matrix turns out to be singular or numerically close to singular, we have a problem we must solve before we can use it.

Geometric interpretation

The cleanest way to picture a multivariate Gaussian is to look at its level sets. A level set is the locus of points where the density takes a fixed value, and because the density is a monotonic function of the Mahalanobis distance, level sets are exactly the level sets of that quadratic form. In two dimensions they are ellipses; in three dimensions they are ellipsoids; in higher dimensions they are hyperellipsoids. They are all centred at the mean $\boldsymbol{\mu}$, because the quadratic form is zero only there.

The orientation and shape of the ellipses is set by the covariance matrix through its eigendecomposition. Because $\boldsymbol{\Sigma}$ is symmetric and positive-definite, the spectral theorem gives us $\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$, where the columns of $\mathbf{Q}$ are orthonormal eigenvectors and $\boldsymbol{\Lambda}$ is a diagonal matrix of strictly positive eigenvalues $\lambda_1, \ldots, \lambda_d$. The principal axes of every level-set ellipsoid point along the eigenvectors, and the semi-axis lengths are proportional to $\sqrt{\lambda_i}$. Big eigenvalue: long axis, lots of variance in that direction. Small eigenvalue: short axis, little variance. Equal eigenvalues: a sphere, no preferred direction.

A small worked example will fix this picture. Take $\boldsymbol{\mu} = \mathbf{0}$ and $$\boldsymbol{\Sigma} = \begin{pmatrix} 4 & 0 \\ 0 & 1 \end{pmatrix}.$$ The covariance matrix is already diagonal, so the eigenvectors are the standard basis vectors $(1,0)$ and $(0,1)$ and the eigenvalues are $\lambda_1 = 4$ and $\lambda_2 = 1$. The Mahalanobis distance is $x_1^2 / 4 + x_2^2 / 1$, so the contour where this equals $1$ is the ellipse with horizontal semi-axis $\sqrt{4} = 2$ and vertical semi-axis $\sqrt{1} = 1$. This is the so-called one-sigma ellipse. Most of the probability mass sits inside it, and almost all of it sits inside the two-sigma or three-sigma ellipses obtained by enlarging the contour. The first coordinate has standard deviation $2$, the second has standard deviation $1$, and there is no correlation, so samples cluster in a horizontal cigar shape twice as wide as it is tall.

If you rotate this picture by adding off-diagonal entries to $\boldsymbol{\Sigma}$, the ellipse tilts. Positive correlation tilts the cigar so it slopes up and to the right; negative correlation tilts it so it slopes down and to the right. The amount of tilt is set by the off-diagonal entries; the eigenvalues still tell you the lengths of the axes, but you must compute the eigenvectors to find their directions. This is exactly the calculation that PCA performs, and one way to understand PCA (Chapter 2) is as the operation that finds the principal axes of the empirical Gaussian fitted to the data.

The Mahalanobis distance also gives us a coordinate-invariant notion of how unusual a point is. If a point sits at Mahalanobis distance $3$ from the mean, the squared distance is $9$ and the density there is $\exp(-9/2) \approx 0.011$ times the peak density, regardless of which direction the point lies in. In a univariate Gaussian, three standard deviations from the mean is famously where about $99.7$ per cent of the probability is contained. The multivariate analogue is more subtle (the relevant tail probabilities are chi-squared rather than normal), but the basic intuition that Mahalanobis distance is the right way to measure unusualness still applies.

Marginals and conditionals

Now suppose we partition our random vector $\mathbf{x}$ into two blocks, $\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2)$, with corresponding partitions of the mean $\boldsymbol{\mu} = (\boldsymbol{\mu}_1, \boldsymbol{\mu}_2)$ and the covariance $$\boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}.$$ The diagonal blocks $\boldsymbol{\Sigma}_{11}$ and $\boldsymbol{\Sigma}_{22}$ are the within-block covariances; the off-diagonal blocks $\boldsymbol{\Sigma}_{12}$ and $\boldsymbol{\Sigma}_{21} = \boldsymbol{\Sigma}_{12}^\top$ record the cross-covariances.

The marginal distribution of $\mathbf{x}_1$ is the simplest of all multivariate Gaussian results. You ignore the other block entirely: $$\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11}).$$ Just chop out the relevant block of the mean and the relevant diagonal block of the covariance. There is no integral to do, no formula to remember; the answer is staring at you in the partition. This is one of the things that makes Gaussians so pleasant: marginalisation, which for an arbitrary joint density requires solving an integral, here is an act of bookkeeping.

The conditional distribution is more interesting. Suppose we observe $\mathbf{x}_2$ and want the distribution of $\mathbf{x}_1$ given that observation. The answer is again Gaussian, $\mathbf{x}_1 \mid \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})$, with mean and covariance $$\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2),$$ $$\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}.$$

Look at what this is saying. The conditional mean is the prior mean $\boldsymbol{\mu}_1$ plus a correction that is linear in the deviation $(\mathbf{x}_2 - \boldsymbol{\mu}_2)$ between what we observed and what we expected to observe. The matrix $\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}$ is the multivariate generalisation of a regression coefficient: in one dimension, with $\boldsymbol{\Sigma}_{12} = \rho \sigma_1 \sigma_2$ and $\boldsymbol{\Sigma}_{22} = \sigma_2^2$, this matrix collapses to $\rho \sigma_1 / \sigma_2$, the familiar slope of the linear regression of $X_1$ on $X_2$. The conditional covariance $\boldsymbol{\Sigma}_{1|2}$ is the prior covariance $\boldsymbol{\Sigma}_{11}$ minus a positive-semidefinite correction term that captures how much uncertainty the observation removed.

Two features deserve attention. First, the conditional covariance does not depend on the observed value $\mathbf{x}_2$. Whatever you observe, the residual uncertainty about $\mathbf{x}_1$ is the same; only the centre of the distribution shifts. This is a uniquely Gaussian feature and would be wrong for almost any other distribution. Second, the conditional mean is linear in $\mathbf{x}_2$. Linear regression is not just a convenient model that happens to work well; it is the exact Bayes-optimal predictor when the joint distribution is Gaussian.

These formulas are the engine that makes Gaussian processes tractable. A Gaussian process places a multivariate Gaussian prior over function values at any finite set of input points. To predict at a new input, you partition the joint Gaussian into the observed function values $\mathbf{x}_2$ and the unobserved value $\mathbf{x}_1$, and apply the conditional formulas. The predictive mean and predictive variance fall out in closed form, with the kernel matrix playing the role of $\boldsymbol{\Sigma}$. The same formulas drive the Kalman filter, where $\mathbf{x}_2$ is a noisy measurement and $\mathbf{x}_1$ is the hidden state we wish to update.

Linear transformations

Suppose $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and we form the affine transformation $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ with a matrix $\mathbf{A}$ and a vector $\mathbf{b}$. Then $\mathbf{y}$ is also Gaussian, with mean and covariance $$\mathbf{y} \sim \mathcal{N}(\mathbf{A}\boldsymbol{\mu} + \mathbf{b},\ \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^\top).$$ Affine maps stay inside the family. The mean transforms by $\mathbf{A}\boldsymbol{\mu} + \mathbf{b}$, exactly the same way the input transformed; the covariance picks up an $\mathbf{A}$ on the left and an $\mathbf{A}^\top$ on the right, which is the way bilinear forms transform under linear maps.

This single property powers a surprising amount of machinery. The reparameterisation trick used to train variational autoencoders is a special case (we will see it shortly). The Kalman filter prediction step is the affine map of the previous Gaussian state through the dynamics matrix. Normalising flows compose many such affine maps with non-linear activations to bend simple Gaussians into rich shapes. Whitening transforms a Gaussian with arbitrary covariance into one with identity covariance by choosing $\mathbf{A}$ as the inverse of a square root of $\boldsymbol{\Sigma}$.

Sampling and the reparameterisation trick

How do we actually draw a sample from $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ on a computer? The recipe is short. First, compute a Cholesky factorisation $\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top$, where $\mathbf{L}$ is a lower-triangular matrix with positive diagonal. Cholesky factorisation is a standard linear-algebra routine, more numerically stable and roughly twice as fast as a general LU factorisation, and it exists precisely because $\boldsymbol{\Sigma}$ is symmetric positive-definite. Second, draw a vector $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ of independent standard-normal entries; every numerical library can do this with a Box-Muller or Ziggurat algorithm. Third, set $\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}$. The result is a sample from the desired distribution.

To verify this is correct, apply the affine-transformation rule. The mean of $\mathbf{x}$ is $\boldsymbol{\mu} + \mathbf{L}\,\mathbb{E}[\mathbf{z}] = \boldsymbol{\mu}$. The covariance is $\mathbf{L}\,\mathbf{I}\,\mathbf{L}^\top = \mathbf{L}\mathbf{L}^\top = \boldsymbol{\Sigma}$. And because $\mathbf{x}$ is an affine function of a Gaussian, it is itself Gaussian. The recipe works.

This same recipe is the heart of the reparameterisation trick that Kingma and Welling introduced in 2014 to make variational autoencoders trainable by gradient descent. The problem they faced was this: a VAE wants to learn a posterior $q(\mathbf{z} \mid \mathbf{x})$ over latent codes, parameterised by a neural network, and to train it one needs gradients of an expectation $\mathbb{E}_{\mathbf{z} \sim q}[f(\mathbf{z})]$ with respect to the parameters of $q$. Naively, one would draw a sample $\mathbf{z}$ from $q$ and use $f(\mathbf{z})$ as a Monte Carlo estimate, but the sampling step blocks the gradient: you cannot back-propagate through a random number generator.

The reparameterisation trick rewrites the sampling step. Instead of drawing $\mathbf{z}$ directly from $\mathcal{N}(\boldsymbol{\mu}_\theta(\mathbf{x}), \boldsymbol{\Sigma}_\theta(\mathbf{x}))$, where $\theta$ are the parameters of the encoder network, you draw a parameter-free $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and compute $\mathbf{z} = \boldsymbol{\mu}_\theta(\mathbf{x}) + \mathbf{L}_\theta(\mathbf{x})\boldsymbol{\epsilon}$. The randomness now sits entirely in $\boldsymbol{\epsilon}$, which has no parameters to differentiate. The mapping from $\theta$ to $\mathbf{z}$ is deterministic and differentiable, so back-propagation flows through it just like any other layer of the network. With this single rearrangement, gradient-based training of latent-variable models becomes possible. Diffusion models lean on the same idea: every reverse-diffusion step samples an update by reparameterising a Gaussian.

Where multivariate Gaussians appear in AI

It is worth listing, in one place, the parts of artificial intelligence that depend on multivariate Gaussian machinery, so that you start to recognise it as you read the rest of the book.

Gaussian processes use it as the prior over function values, with the kernel matrix as $\boldsymbol{\Sigma}$. Posterior predictions are conditional Gaussians; Bayesian optimisation builds on top of these.
Bayesian linear regression has a Gaussian likelihood and a Gaussian prior over weights; the posterior is also Gaussian, computed by the conditioning formula above.
Variational autoencoders represent the approximate posterior over latents as a multivariate Gaussian with a diagonal (or low-rank) covariance, and use the reparameterisation trick to train end to end.
Diffusion models add Gaussian noise on a fixed schedule during the forward process, and parameterise the reverse process with a Gaussian whose mean (and sometimes covariance) is predicted by a neural network.
Kalman filters and their non-linear cousins (extended, unscented) propagate Gaussian beliefs over hidden states through linear (or linearised) dynamics and observations.
Neural network weight initialisation draws weights from a Gaussian whose variance is scaled with layer width (Xavier, He) to keep activations and gradients well-behaved at the start of training.
Energy-based models often use a Gaussian negative log-density as the simplest possible parametric energy.
Score matching and denoising score matching frame learning as estimating the gradient of $\log p(\mathbf{x})$, with Gaussian noise as the perturbation that makes the objective tractable.
Probabilistic PCA, factor analysis and Gaussian mixture models all sit inside the multivariate-Gaussian world, and form the bridge between classical statistics and modern unsupervised learning.

Wherever you see a covariance matrix, an ellipsoid, a Mahalanobis distance, or a closed-form posterior, the multivariate Gaussian is doing the work.

What you should take away

The multivariate Gaussian generalises the bell curve to vectors, with a mean vector $\boldsymbol{\mu}$ and a symmetric positive-definite covariance matrix $\boldsymbol{\Sigma}$; its density depends on $\mathbf{x}$ only through the squared Mahalanobis distance $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$.
Level sets of the density are ellipsoids centred at $\boldsymbol{\mu}$, with axes along the eigenvectors of $\boldsymbol{\Sigma}$ and semi-axis lengths $\sqrt{\lambda_i}$; this is the same eigendecomposition that PCA uses.
Marginals are obtained by chopping out the relevant block; conditionals are Gaussian with mean $\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)$ and covariance $\boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}$, which makes Gaussian processes and Kalman filters closed-form.
Affine maps preserve Gaussianity: $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$ is Gaussian with mean $\mathbf{A}\boldsymbol{\mu} + \mathbf{b}$ and covariance $\mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^\top$, which underlies whitening, the Kalman prediction step and normalising flows.
Sampling is done by Cholesky factorisation, $\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}$ with $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$; the same construction is the reparameterisation trick that makes variational autoencoders and diffusion models trainable by gradient descent.