Probability: 4.5   Common distributions

Dr Chris Paton

4.5 Common distributions

If you flick through enough machine-learning papers, an odd thing strikes you: the same handful of probability distributions keep showing up, again and again, dressed in different notation. The output layer of a binary classifier reaches for the Bernoulli. The output of a softmax classifier reaches for the categorical. The noise terms in a linear regression, the priors on the weights of a Bayesian neural network, the latent codes inside a variational autoencoder, and the reverse process of a diffusion model all reach for the Gaussian. The number of clicks on a banner advert per minute, or the number of photons hitting a detector, reaches for the Poisson. There is no point pretending each new model invents a new distribution. Almost everything is a recombination of a small standard zoo, and once you know the zoo by name you can read most papers on sight.

This section catalogues the zoo. For each distribution we give the support, the formula, the mean, the variance, a worked numerical example, and a note on where it appears in AI. Section 4.10 returns to the multivariate Gaussian in detail.

Read this section once for orientation; come back to it as a reference. Do not try to memorise every formula. The pattern that matters is which distribution to reach for, given the shape of the data.

Symbols Used Here

$X$random variable

$\sim$"is distributed as"

$p, q$Bernoulli probability of success

$\mu$mean

$\sigma^2$variance

$\lambda$rate (Poisson, exponential)

$\alpha, \beta$shape parameters (Beta, Gamma)

$\boldsymbol{\pi}$vector of categorical probabilities

$\boldsymbol{\Sigma}$covariance matrix

Discrete distributions

Discrete distributions describe quantities that take values in a countable set: yes/no, the integers 0, 1, 2, ..., or one of $K$ class labels. Probabilities sit on individual points (the PMF) and sum to one.

Bernoulli. The simplest distribution there is. A random variable $X$ takes the value $1$ with probability $p$ and $0$ with probability $1 - p$. We write $X \sim \text{Bern}(p)$. The mean is $p$ and the variance is $p(1-p)$. The variance is largest at $p = 0.5$ (a fair coin: maximum uncertainty) and smallest at $p = 0$ or $p = 1$ (no uncertainty at all). In AI, every binary classifier, spam vs ham, malignant vs benign, fraud vs not, has a Bernoulli on its output. With $p = 0.7$, the mean is $0.7$ and the variance is $0.7 \times 0.3 = 0.21$. Where does the $p = 0.7$ come from in practice? It is the output of a sigmoid applied to a logit produced by the model, and the model is trained to push that probability towards $1$ on positive examples and towards $0$ on negative examples. The cross-entropy loss used for training is exactly the negative log-likelihood of a Bernoulli, with $p$ standing in for the model's predicted probability.

Binomial. Run $n$ independent Bernoulli trials, each with success probability $p$, and count the successes. The total is $X \sim \text{Bin}(n, p)$, with PMF $$ P(X = k) = \binom{n}{k} p^{k} (1 - p)^{n-k}, \qquad k = 0, 1, \ldots, n. $$ The mean is $np$ and the variance is $np(1-p)$. With $n = 10$ and $p = 0.5$ the mean is $5$ and the variance is $2.5$. The standard deviation is $\sqrt{2.5} \approx 1.58$, so most of the mass falls within roughly $5 \pm 3$ successes, a useful intuition. The binomial is the workhorse behind A/B-test confidence intervals: if 80 of 200 visitors click, you are observing a binomial sample. The maximum-likelihood estimate of $p$ is simply $\hat p = k/n$, and a 95% confidence interval is approximately $\hat p \pm 1.96 \sqrt{\hat p (1 - \hat p)/n}$, itself a Gaussian approximation that the central limit theorem licenses for moderate $n$.

Categorical. Generalise Bernoulli from two outcomes to $K$ outcomes. The variable $X$ takes a value in $\{1, 2, \ldots, K\}$ with $P(X = k) = \pi_k$, subject to $\sum_k \pi_k = 1$. The vector $\boldsymbol{\pi}$ lives on the probability simplex. Every multi-class classifier, image classes, next-token language model, news topic, emits a categorical distribution at the output, typically by passing logits through a softmax. Some texts call this the multinoulli, but categorical is the standard name.

Multinomial. Multinomial is to categorical what binomial is to Bernoulli. Draw $n$ independent categorical samples and count how many fell into each category. The result is the count vector $(N_1, \ldots, N_K)$, with PMF $$ P(N_1 = n_1, \ldots, N_K = n_K) = \binom{n}{n_1\, n_2\, \cdots\, n_K} \prod_{k=1}^{K} \pi_k^{n_k}. $$ This is the natural distribution over bag-of-words counts in document modelling and the basis for naive Bayes text classifiers.

Poisson. When events arrive independently at a constant average rate, the count of events in a fixed window follows a Poisson distribution. The PMF is $$ P(X = k) = \frac{e^{-\lambda} \lambda^{k}}{k!}, \qquad k = 0, 1, 2, \ldots $$ The mean and the variance are both equal to $\lambda$, which is a useful coincidence: it gives an immediate diagnostic for whether real count data are Poisson at all (compute the sample mean and the sample variance; if they differ, the data are over- or under-dispersed and a negative-binomial may fit better). With $\lambda = 2$, the probability of zero events is $e^{-2} \approx 0.135$, of exactly one event $2 e^{-2} \approx 0.271$, and of exactly two events $2 e^{-2} \approx 0.271$, the same as one event, by coincidence of the formula at $k = \lambda$. Poisson models photons per pixel, requests per second, mutations per genome, goals scored per football match, and any other count of rare events. It is also the limit of the Binomial as $n \to \infty$, $p \to 0$, with $np = \lambda$ held fixed, which is why "rare events in a large pool" lands here. As a worked example, suppose a clinic sees a mean of $\lambda = 3$ severe-allergy presentations per day. The probability of seeing zero on a given day is $e^{-3} \approx 0.0498$, so roughly one day in twenty will be empty. The probability of six or more is about $0.084$, computed from $1 - \sum_{k=0}^{5} e^{-3} 3^{k}/k!$, meaning that even with a peaceful average of three per day, the clinic should expect a busy day with six or more presentations roughly once a fortnight.

Geometric. Keep tossing a biased coin (success probability $p$) until the first success; let $X$ be the trial on which it occurs. Then $P(X = k) = (1 - p)^{k-1} p$, with mean $1/p$ and variance $(1 - p)/p^{2}$. The geometric distribution is memoryless, which is to say that having failed $m$ times tells you nothing about how long you must wait next. Variable-length codes in information theory and the runtime analysis of randomised algorithms both lean on it. As an example, a hashing scheme that resolves collisions by retrying with a fresh hash function until success will, on average, need $1/p$ attempts where $p$ is the per-trial success probability, a one-line consequence of the geometric mean.

Continuous distributions

Continuous distributions describe quantities measured on a continuum: lengths, weights, errors, durations, probabilities. Probabilities now sit on intervals (the PDF) and integrate to one. Probability at a single point is always zero.

Uniform. $X \sim \text{Uniform}(a, b)$ has constant density $1/(b - a)$ on $[a, b]$ and zero elsewhere. The mean is $(a + b)/2$ and the variance is $(b - a)^{2}/12$. The uniform is the maximum-entropy distribution on a bounded interval; it is what you choose when you know the support and nothing else. Pseudo-random number generators produce uniforms on $[0, 1]$, from which any other distribution can be sampled by inverse-CDF transformation.

Gaussian (Normal). The single most important distribution in statistics. $X \sim \mathcal{N}(\mu, \sigma^{2})$ has density $$ f(x) = \frac{1}{\sqrt{2\pi \sigma^{2}}} \exp\!\left(- \frac{(x - \mu)^{2}}{2\sigma^{2}}\right). $$ The mean is $\mu$ and the variance is $\sigma^{2}$. The standard normal has $\mu = 0$ and $\sigma^{2} = 1$. Three numbers worth memorising once and remembering forever: a Gaussian places about 68% of its mass within one standard deviation of the mean, about 95% within two, and about 99.7% within three. So if exam scores are $\mathcal{N}(100, 15^{2})$, then about 68% of candidates score between 85 and 115, and 99.7% score between 55 and 145.

The Gaussian is everywhere because of the central limit theorem (section 4.9): sums of many small independent effects tend to a Gaussian regardless of the underlying distribution. Hence measurement noise, reaction times, and aggregate effects look Gaussian. In AI the Gaussian is the default likelihood for continuous targets in regression, the default prior on neural-network weights, the noise process in diffusion models, the latent-code distribution in variational autoencoders, and the perturbation in countless robustness analyses.

A quick worked example to fix the 68/95/99.7 rule. Suppose neural-network weights at initialisation are drawn from $\mathcal{N}(0, 0.01)$, so $\sigma = 0.1$. About 68% of weights will lie in $[-0.1, 0.1]$, about 95% in $[-0.2, 0.2]$, and about 99.7% in $[-0.3, 0.3]$. A weight of magnitude $0.5$, five standard deviations from zero, would be vanishingly rare under this initialisation: it should occur with probability less than one in a million per weight. This is the kind of back-of-envelope reasoning that turns "Gaussian" from a word into a tool.

Exponential. $X \sim \text{Exp}(\lambda)$ has density $\lambda e^{-\lambda x}$ on $[0, \infty)$. Mean $1/\lambda$, variance $1/\lambda^{2}$. Like the geometric (its discrete cousin), the exponential is memoryless: $P(X > s + t \mid X > s) = P(X > t)$. It is the distribution of waiting times between Poisson events. If a server gets a Poisson($\lambda$) stream of requests per second, the gaps between requests are Exponential($\lambda$).

Gamma. A flexible non-negative distribution with two parameters: $X \sim \text{Gamma}(\alpha, \beta)$ has density proportional to $x^{\alpha - 1} e^{-\beta x}$ on $[0, \infty)$, mean $\alpha/\beta$, and variance $\alpha/\beta^{2}$. The exponential is the special case $\alpha = 1$. The Gamma is the conjugate prior for the Poisson rate and for the precision (inverse variance) of a Gaussian, which is why it appears all over Bayesian inference.

Beta. $X \sim \text{Beta}(\alpha, \beta)$ lives on $[0, 1]$ with density proportional to $x^{\alpha - 1} (1 - x)^{\beta - 1}$. Mean $\alpha/(\alpha + \beta)$. The shape varies enormously: $\alpha = \beta = 1$ gives the uniform; $\alpha, \beta < 1$ gives a U-shape; $\alpha = \beta$ large gives a tight bell around $0.5$. The Beta is the conjugate prior for any Bernoulli or binomial parameter $p$, which means: if you start with a Beta$(\alpha, \beta)$ prior and observe $s$ successes in $n$ trials, your posterior is Beta$(\alpha + s,\ \beta + n - s)$. The hyperparameters act as prior pseudo-counts. This update rule alone unlocks Bayesian A/B testing, Thompson sampling, and bandit algorithms.

Dirichlet. The multivariate generalisation of the Beta. A draw from a Dirichlet$(\boldsymbol{\alpha})$ is a vector of probabilities summing to one, a point on the simplex. It is the conjugate prior for the categorical and multinomial. Dirichlets are the prior over topic distributions in latent Dirichlet allocation (LDA) and over policies in some reinforcement-learning algorithms. The hyperparameter vector $\boldsymbol{\alpha}$ has an intuitive reading: large equal entries concentrate the prior near a uniform mixture; small entries push the prior towards sparse, spiky vectors with most mass on one or two categories. Sparse Dirichlets are exactly the inductive bias topic models use to make individual documents focus on a handful of topics rather than smearing across all of them.

Distributions over functions and beyond

Once you understand the univariate menu above, four further distributions cover almost all of what you will meet in modern AI papers.

Multivariate Gaussian. A distribution over vectors $\mathbf{x} \in \mathbb{R}^{d}$, written $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$. The mean is a vector $\boldsymbol{\mu}$ and the covariance is a positive semi-definite matrix $\boldsymbol{\Sigma}$. The density involves $\det(\boldsymbol{\Sigma})$ and $\boldsymbol{\Sigma}^{-1}$. The simplest case is the isotropic Gaussian with $\boldsymbol{\mu} = \mathbf{0}$ and $\boldsymbol{\Sigma} = \mathbf{I}$: each coordinate is an independent standard normal. Modern generative models, diffusion, normalising flows, VAEs, work by learning a smooth transformation from such a simple isotropic Gaussian into the complicated distribution of images, sounds, or text embeddings. Section 4.10 treats the multivariate Gaussian carefully.

Mixture distributions. A mixture is a weighted sum of simpler component distributions: $p(x) = \sum_{k=1}^{K} \pi_{k}\, p_{k}(x)$, with mixing weights $\pi_{k} \geq 0$ summing to one. Gaussian mixture models (GMMs) are the canonical case and form the backbone of soft clustering. Mixtures handle multi-modal data, distributions with more than one peak, that a single Gaussian cannot. The expectation-maximisation algorithm fits the components and weights of a GMM iteratively, and the same machinery underpins many older speech-recognition systems where each phoneme is modelled by a small Gaussian mixture.

Student's $t$. The $t$-distribution looks like a Gaussian but with heavier tails. The parameter $\nu$ controls the tail weight; as $\nu \to \infty$ it becomes Gaussian. With small $\nu$ (say 3 or 4) it accommodates outliers gracefully, which is why robust regressions and Bayesian posteriors with unknown variance feature the $t$.

Laplace. Density proportional to $e^{-|x - \mu|/b}$. Sharper peak than the Gaussian, heavier tails. Two AI roles: it is the noise distribution that the differential-privacy mechanism adds for $\epsilon$-DP, and a Laplace prior on weights yields the L1 (LASSO) penalty. The connection between a Laplace prior and L1 regularisation is exact, not metaphorical: maximising the log-posterior of weights under a Laplace prior is equivalent to minimising the squared-error loss plus a $\lambda \sum_i |w_i|$ penalty. This is one of the cleanest examples of a modelling choice (heavy-tailed prior over weights) translating directly into an algorithmic choice (sparse fitting).

Choosing a distribution: a small flowchart

Most modelling decisions begin with a single question: what shape does the data take? The list below covers most cases.

Binary outcome (yes/no, click/no-click) → Bernoulli.
One of $K$ classes → Categorical.
Counts in $K$ classes from $n$ trials → Multinomial.
Counts of independent rare events in a window → Poisson.
Trials until first success → Geometric.
Continuous, symmetric, light-tailed (errors, sums) → Gaussian.
Continuous, non-negative, waiting times → Exponential (memoryless) or Gamma (flexible).
A probability or proportion in $[0, 1]$ → Beta.
A vector of probabilities on the simplex → Dirichlet.
Continuous with known outliers or fat tails → Student-$t$ or Laplace.
Multivariate continuous → Multivariate Gaussian.
Multi-modal continuous → Mixture of Gaussians.

In practice, the choice of distribution is often baked into the model. A logistic-regression layer assumes Bernoulli outputs; a softmax layer assumes categorical outputs; a regression head with squared-error loss assumes Gaussian noise. Recognising which assumption is in play is the first step to questioning it.

A second useful habit is to ask three questions of any candidate distribution before reaching for it. First, what is the support, the set of values where the density or PMF is non-zero? A negative loss makes no sense; a probability greater than one makes no sense; a count of $-3$ makes no sense. Matching support to data is a non-negotiable first check. Second, what are the moments, the mean and the variance, and do they roughly match the data? A Poisson assumes mean equals variance; if the sample variance is twice the sample mean, Poisson is the wrong shape. Third, what are the tails like? Gaussian tails fall off as $e^{-x^2/2}$, which is fast; a single rare event ten standard deviations from the mean is essentially impossible under a Gaussian but routine under a Student-$t$ with three degrees of freedom. If your data have outliers, financial returns, network latencies, sensor errors, a heavy-tailed distribution is usually safer than a Gaussian.

Where these appear in AI

The same six or seven distributions account for almost every model you will meet. A short tour, by application area:

Logistic regression outputs a Bernoulli per sample, with the success probability set by a sigmoid of a linear function of the features. The cross-entropy loss is the negative log-likelihood of the Bernoulli.
Multi-class classifiers (and language models for next-token prediction) output a categorical distribution per sample, with the probability vector set by a softmax of logits. Cross-entropy is again the negative log-likelihood, this time of the categorical.
Linear regression under a squared-error loss is the maximum-likelihood fit of a Gaussian likelihood with constant variance. L2 regularisation corresponds to a Gaussian prior on the weights; L1 regularisation corresponds to a Laplace prior.
Variational autoencoders combine an isotropic Gaussian prior over latent codes, a Gaussian (or Bernoulli, for binary pixels) likelihood for the data, and a Gaussian variational posterior. Three Gaussians in one model.
Diffusion models add Gaussian noise across many steps to destroy the data, and learn to denoise with another Gaussian-conditioned network. The forward process is a Gaussian Markov chain by construction.
Topic models (LDA) put a Dirichlet prior on per-document topic distributions and a Dirichlet prior on per-topic word distributions, with a categorical likelihood for each observed word.
Bayesian neural networks typically place independent Gaussian priors on every weight, exactly the prior whose maximum-a-posteriori estimate is L2-regularised training.
Bandits and Bayesian A/B testing maintain a Beta posterior over each arm's success probability, updating in closed form after every observation.
Robust regression swaps the Gaussian likelihood for a Student-$t$ likelihood, which prevents a single large residual from dominating the fit.
Differential privacy adds Laplace (or Gaussian) noise calibrated to the sensitivity of the query, granting a formal privacy guarantee.

The recurring lesson is that the choice of distribution is a modelling choice, not a fact about the world. We use Gaussians not because data are exactly Gaussian, but because the assumption is convenient and often nearly true. Real heights are not perfectly Gaussian (they are bounded below by zero and above by physical limits), yet a Gaussian model gives accurate predictions across most of the population, and that is the bargain modelling always strikes.

It is also worth noticing how often the same distribution recurs across very different parts of an AI system. A modern transformer language model uses a Gaussian initialisation for its weights, a Bernoulli or categorical at every output position, a categorical over tokens at decoding time, and (if it has been fine-tuned with RLHF) a Bernoulli over preference judgements during training. A self-driving stack might use a Gaussian for state estimation in a Kalman filter, a categorical over manoeuvres in a high-level planner, a Poisson for incoming sensor messages, and a mixture of Gaussians to represent multi-modal beliefs about other vehicles' future trajectories. The same six or seven shapes, recombined.

One last warning. Real data are messy. They have missing values, contamination, censoring, batch effects, and structure that no single tidy distribution captures. The catalogue here is a set of building blocks, not a list of pretensions about the universe. The job is to pick blocks whose shape matches the data, combine them sensibly, and check that the resulting model makes calibrated predictions on held-out examples. Whenever a model is going badly, one of the first questions to ask is whether the assumed distribution actually fits.

What you should take away

A small zoo covers almost everything. Bernoulli, categorical, Poisson, Gaussian, exponential, Beta and Dirichlet, plus the multivariate Gaussian, account for the vast majority of models in modern AI. Learn this list cold.
Match the distribution to the data shape. Binary outcomes call for Bernoulli, $K$-class outcomes for categorical, counts of rare events for Poisson, continuous symmetric variables for Gaussian, non-negative waiting times for exponential or gamma, probabilities for Beta.
Memorise the Gaussian rule of thumb: 68/95/99.7. A Gaussian places about 68% of its mass within one standard deviation, 95% within two, and 99.7% within three. This rule is the quickest sanity check in applied statistics.
Conjugate priors give cheap Bayesian updates. Beta updates Bernoulli, Dirichlet updates categorical, Gamma updates Poisson, and Gaussian updates the mean of a Gaussian. The hyperparameters behave as pseudo-counts, and the posterior stays in the same family.
Most loss functions are negative log-likelihoods of a distribution. Cross-entropy for Bernoulli or categorical, squared error for Gaussian, absolute error for Laplace. Recognising the implicit distribution lets you swap losses sensibly, change priors, and reason about uncertainty rather than just point predictions.