Also known as: normal distribution, bell curve
The Gaussian (or normal) distribution on $\mathbb{R}$ has density
$$\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$
parameterised by mean $\mu$ and variance $\sigma^2$. The standard normal $\mathcal{N}(0, 1)$ is the special case.
The multivariate Gaussian on $\mathbb{R}^d$ has density
$$\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right)$$
with mean vector $\mu \in \mathbb{R}^d$ and positive-definite covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$.
Key properties:
Linear combinations: if $X \sim \mathcal{N}(\mu, \Sigma)$ then $AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top)$. Closed under linear transformation, a fundamental property exploited in Kalman filters, linear-Gaussian state-space models, and many more.
Independence: jointly Gaussian variables are independent iff their covariance is zero. (Pairwise zero-covariance does not imply independence in general, but for Gaussians it does.)
Maximum-entropy: the Gaussian maximises differential entropy among distributions with given mean and variance. This is one justification for Gaussian assumptions when only mean and variance are known.
Central limit theorem: sums of many independent random variables (with finite variance) converge to a Gaussian. Justifies Gaussian approximations to many empirical phenomena.
Conjugate prior: for a Gaussian likelihood with known variance, a Gaussian prior on the mean gives a Gaussian posterior. The Normal-Inverse-Wishart is the conjugate prior for joint mean and covariance.
In AI / machine learning:
- Gaussian noise in linear regression, MLE = OLS.
- Gaussian process regression, model functions as samples from a Gaussian process.
- VAEs use Gaussian encoders/decoders.
- Diffusion models add Gaussian noise during training.
- Reparameterisation trick: $z = \mu + \sigma \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, sample from a Gaussian while preserving differentiability.
- Initialisation: weights commonly initialised $\mathcal{N}(0, 2/\mathrm{fan\_in})$ (He), $\mathcal{N}(0, 1/(\mathrm{fan\_in} + \mathrm{fan\_out}))$ (Xavier).
- Gaussian mixture models for clustering and density estimation.
- Kalman filtering as the linear-Gaussian state-space special case.
Sampling: $z \sim \mathcal{N}(\mu, \Sigma)$ via $z = \mu + L \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ and $L L^\top = \Sigma$ is a Cholesky factorisation. For large $d$ where Cholesky is expensive, alternative methods (squared-exponential kernels with structure exploitation, RFF) are used.
Heavy-tailed alternatives (Student-$t$, Laplace, generalised Gaussian) are sometimes preferred when robustness to outliers matters more than analytical tractability.
Interactive
Video
Related terms: Probability Distribution, Variational Autoencoder, Kalman Filter
Discussed in:
- Chapter 4: Probability, Probability