Gaussian Distribution, Glossary, Textbook of AI

Also known as: normal distribution, bell curve

The Gaussian (or normal) distribution on $\mathbb{R}$ has density

$$\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

parameterised by mean $\mu$ and variance $\sigma^2$. The standard normal $\mathcal{N}(0, 1)$ is the special case.

The multivariate Gaussian on $\mathbb{R}^d$ has density

$$\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right)$$

with mean vector $\mu \in \mathbb{R}^d$ and positive-definite covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$.

Key properties:

Linear combinations: if $X \sim \mathcal{N}(\mu, \Sigma)$ then $AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top)$. Closed under linear transformation, a fundamental property exploited in Kalman filters, linear-Gaussian state-space models, and many more.

Independence: jointly Gaussian variables are independent iff their covariance is zero. (Pairwise zero-covariance does not imply independence in general, but for Gaussians it does.)

Maximum-entropy: the Gaussian maximises differential entropy among distributions with given mean and variance. This is one justification for Gaussian assumptions when only mean and variance are known.

Central limit theorem: sums of many independent random variables (with finite variance) converge to a Gaussian. Justifies Gaussian approximations to many empirical phenomena.

Conjugate prior: for a Gaussian likelihood with known variance, a Gaussian prior on the mean gives a Gaussian posterior. The Normal-Inverse-Wishart is the conjugate prior for joint mean and covariance.

In AI / machine learning:

Gaussian noise in linear regression, MLE = OLS.
Gaussian process regression, model functions as samples from a Gaussian process.
VAEs use Gaussian encoders/decoders.
Diffusion models add Gaussian noise during training.
Reparameterisation trick: $z = \mu + \sigma \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, sample from a Gaussian while preserving differentiability.
Initialisation: weights commonly initialised $\mathcal{N}(0, 2/\mathrm{fan\_in})$ (He), $\mathcal{N}(0, 1/(\mathrm{fan\_in} + \mathrm{fan\_out}))$ (Xavier).
Gaussian mixture models for clustering and density estimation.
Kalman filtering as the linear-Gaussian state-space special case.

Sampling: $z \sim \mathcal{N}(\mu, \Sigma)$ via $z = \mu + L \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ and $L L^\top = \Sigma$ is a Cholesky factorisation. For large $d$ where Cholesky is expensive, alternative methods (squared-exponential kernels with structure exploitation, RFF) are used.

Heavy-tailed alternatives (Student-$t$, Laplace, generalised Gaussian) are sometimes preferred when robustness to outliers matters more than analytical tractability.

Interactive

A zoo of distributions. Bernoulli, Gaussian, exponential and beta side by side, each shaped by its own parameters.

Sums of any distribution become Gaussian. Roll one die, then two, then ten. The distribution of the average converges to a bell curve.

The 68-95-99.7 rule. A Gaussian's tails fall off so fast that three standard deviations cover virtually all the probability.

Video

Discussed in:

Chapter 4: Probability, Probability

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.