Probability: 4.13   Sampling

Dr Chris Paton

4.13 Sampling

A probability distribution is a mathematical object: a function that assigns weights to outcomes. To turn that object into something a computer can use, we need to do more than write down the formula. We need to be able to draw samples from it, to ask the machine, "Give me a value of $X$ that behaves as if it came from this distribution," and to repeat that request thousands or millions of times.

Almost every interesting calculation in modern AI eventually reduces to a sample. We sample from a Gaussian to initialise the weights of a neural network. We sample tokens from a softmax to make a language model speak. We sample latent codes from a posterior to do Bayesian inference. We sample noisy images and gradually denoise them to make a diffusion model paint. We sample trajectories from a policy to estimate the value of a reinforcement-learning agent. The mathematics in earlier sections told us what the distribution is; sampling tells us how to use it.

The key conceptual move is this. Many of the quantities we care about are expectations: averages of some function $f$ taken under a distribution $\pi$. When the integral $$\mathbb{E}_\pi[f(X)] = \int f(x)\, \pi(x)\, dx$$ has no closed form, which is the rule rather than the exception once neural networks enter the picture, we approximate it by drawing samples $X_1, \dots, X_n$ from $\pi$ and averaging $f(X_i)$. The law of large numbers (§4.9) guarantees this works in the limit. So the practical question becomes: how do we draw the samples?

This section is a catalogue of the four foundational sampling methods, the inverse-CDF method, rejection sampling, importance sampling, and Markov chain Monte Carlo (MCMC), together with two specialisations (Box-Muller for Gaussians, and the family of deep generative samplers). Each method appears repeatedly in the rest of the book. §4.10 introduced the multivariate Gaussian; here we show how to sample from it. Chapter 14 (generative models) builds whole architectures around sampling. Chapter 15 (reinforcement learning) lives or dies by the variance of importance-sampled estimators. Anywhere an integral resists pen and paper, a sampler steps in.

Symbols Used Here

$\pi(x)$target distribution (the one we want to sample from)

$q(x)$proposal distribution (an easier one we can sample from)

$p(x)$generic distribution

$U \sim \text{Uniform}(0, 1)$a uniform random number, the raw material of every sampler

Inverse-CDF method

The simplest sampler in existence rests on a single observation: if a continuous random variable $X$ has cumulative distribution function $F$, then $F(X)$ is itself uniformly distributed on $[0, 1]$. Run the argument backwards and the recipe falls out. Sample $U \sim \text{Uniform}(0, 1)$, then return $X = F^{-1}(U)$. The resulting $X$ has distribution $\pi$. Two lines of code, no waste, no rejection.

Worked example, the Exponential distribution with rate $\lambda$. Its CDF is $$F(x) = 1 - e^{-\lambda x}, \quad x \ge 0.$$ To invert, set $u = 1 - e^{-\lambda x}$ and solve for $x$: $$1 - u = e^{-\lambda x} \implies -\ln(1-u) = \lambda x \implies x = -\frac{\ln(1-u)}{\lambda}.$$ So the quantile function is $F^{-1}(u) = -\ln(1-u)/\lambda$. To sample one waiting time from an Exponential$(\lambda)$, draw $U$ from a uniform random-number generator and return $X = -\ln(1-U)/\lambda$. (In practice, since $1-U$ is also Uniform$(0,1)$, implementations often write $-\ln(U)/\lambda$ to save a subtraction.)

The Gumbel distribution is another classic: $F^{-1}(u) = -\ln(-\ln u)$. This underwrites the Gumbel-Max trick, a useful identity for sampling from a discrete distribution. If you want to sample from a Categorical with probabilities $\pi_1, \dots, \pi_K$, draw independent Gumbels $g_1, \dots, g_K$ and return $\arg\max_k (\log \pi_k + g_k)$. The trick reappears in attention mechanisms, in reinforcement learning, and in every paper that needs a differentiable approximation to a discrete sample.

The catch with inverse-CDF sampling is also its strength. It works whenever $F^{-1}$ is computable, and most useful distributions, including the Gaussian, do not have closed-form quantile functions. For those, we need other tools. But where it applies, inverse-CDF is the gold standard: each uniform draw produces exactly one target sample, with no rejection, no chains, no asymptotics.

Box-Muller and Gaussian sampling

The Gaussian is the most-sampled distribution in computing, and yet its CDF, the error function, has no elementary inverse. The classical workaround, due to Box and Muller in 1958, is a change of variables. Take two independent uniforms $U_1, U_2 \sim \text{Uniform}(0, 1)$ and form $$Z_1 = \sqrt{-2\ln U_1}\cos(2\pi U_2), \qquad Z_2 = \sqrt{-2\ln U_1}\sin(2\pi U_2).$$ Then $Z_1$ and $Z_2$ are independent standard Gaussians. The proof is a change of variables in polar coordinates: $\sqrt{-2\ln U_1}$ generates a Rayleigh-distributed radius (the marginal length of a 2D Gaussian) and $2\pi U_2$ generates a uniform angle.

In production code, Box-Muller has been superseded by faster algorithms, Marsaglia's polar method avoids the trigonometric calls, and the Ziggurat algorithm uses precomputed rectangles to make the common case of "small $|z|$" essentially branch-free. NumPy's np.random.standard_normal and PyTorch's torch.randn use these internally. You almost never code a Gaussian sampler by hand; you just call randn. But knowing that two uniforms become one Gaussian via Box-Muller is the kind of structural fact that demystifies what "the GPU sampling Gaussians" actually means.

For a multivariate Gaussian $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, the recipe extends cleanly. First, factor the covariance: $\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top$, where $\mathbf{L}$ is lower-triangular (the Cholesky decomposition; see §2.4a). Sample a vector of independent standard normals $\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Return $\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}$. A short calculation shows $\mathbf{X}$ has the right mean and covariance.

This recipe has a second life as the reparameterisation trick, which makes variational autoencoders trainable. Instead of asking "sample $\mathbf{X}$ from $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$", which leaves no path for gradients to flow through $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$, we ask "sample $\mathbf{Z}$ from a fixed $\mathcal{N}(\mathbf{0}, \mathbf{I})$ and then compute $\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}$." Now $\mathbf{X}$ is a deterministic function of the parameters and a parameter-free random source, so backpropagation flows through it as if it were any other differentiable layer. We meet this idea again in Chapter 14.

Rejection sampling

What if you cannot invert the CDF, and there is no Cholesky shortcut? Rejection sampling offers a brute-force solution. Pick a proposal distribution $q$ that you can sample from and that covers your target $\pi$ in the sense that $\pi(x) \le M\, q(x)$ for some constant $M$ and all $x$. The recipe:

Sample $X \sim q$.
Sample $U \sim \text{Uniform}(0, 1)$.
If $U \le \pi(X) / (M\, q(X))$, accept $X$. Otherwise reject and go back to step 1.

The accepted samples are exactly distributed according to $\pi$. The geometric intuition is that we are throwing darts uniformly under the curve $M\, q(x)$ and keeping only those that also fall under $\pi(x)$. The ratio of accepted to thrown is $1/M$, so $M$ is literally the inefficiency factor: if $M = 100$, you expect to draw a hundred proposals for every accepted sample.

The fatal weakness is the curse of dimensionality. To bound $\pi(x) \le M\, q(x)$ everywhere, $M$ must accommodate the worst-case ratio, and in high dimensions even modest mismatches between $\pi$ and $q$ make this constant astronomically large. For a Gaussian target with $\boldsymbol{\Sigma}_\pi$ and a Gaussian proposal with twice the variance, $M$ grows exponentially in dimension. This is why rejection sampling has largely vanished from modern deep learning: nobody samples a 1024-dimensional latent vector by rejection.

Where it survives is in low-dimensional special cases: sampling from truncated distributions (a Gaussian restricted to $[a, b]$), inverse problems with bounded support, and as a teaching device. It also lives on inside library code, scipy.stats uses rejection sampling internally for several distributions, so even when you do not invoke it directly, it may be running on your behalf.

Importance sampling

Sometimes we do not actually need samples from $\pi$, we only need to estimate an expectation under $\pi$. Importance sampling exploits this. It says: sample from a convenient $q$, and correct for the mismatch with a weight.

The algebra is one line: $$\mathbb{E}_\pi[f(X)] = \int f(x)\, \pi(x)\, dx = \int f(x)\, \frac{\pi(x)}{q(x)}\, q(x)\, dx = \mathbb{E}_q\!\left[\frac{\pi(X)}{q(X)} f(X)\right].$$ So if we draw $X_1, \dots, X_n \sim q$, the estimator $$\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n \frac{\pi(X_i)}{q(X_i)} f(X_i)$$ is unbiased for $\mathbb{E}_\pi[f(X)]$. The ratio $w_i = \pi(X_i)/q(X_i)$ is the importance weight. Areas where $\pi$ exceeds $q$ get up-weighted; areas where $\pi$ falls below $q$ get down-weighted.

The catch is variance. The variance of $\hat{\mu}_n$ depends on how well $q$ matches $\pi f$ over the regions where $|f|$ is large. If $q$ has thin tails where $\pi$ has fat ones, a single sample can land in a low-$q$ high-$\pi$ region, generate an enormous weight, and dominate the sum. The estimator remains unbiased, but its variance can be effectively infinite, and for any finite $n$ you might never see the rare event that should swing the estimate.

The standard remedy is self-normalised importance sampling: replace the weights with $\tilde{w}_i = w_i / \sum_j w_j$ and report $\sum_i \tilde{w}_i f(X_i)$. This trades a small bias (it is now consistent rather than unbiased) for far better variance, and removes the need to know the normalising constant of $\pi$.

Importance sampling is everywhere in modern AI. Off-policy reinforcement learning corrects for the gap between behaviour policy and target policy with importance ratios. The REINFORCE estimator is importance sampling in disguise. PPO (the workhorse of LLM fine-tuning) clips importance ratios to bound their variance. Particle filters in robotics propagate weighted samples. Whenever you see a weighted average over draws from a "wrong" distribution, importance sampling is the engine.

MCMC: Markov Chain Monte Carlo

The most powerful, and most patient, sampling family is Markov chain Monte Carlo. The idea is counterintuitive. Instead of constructing a sampler that emits independent draws from $\pi$, construct a Markov chain whose long-run stationary distribution is $\pi$. Run the chain for many steps. After a burn-in period, the states of the chain are (approximately) samples from $\pi$. The samples are correlated, not independent, but for estimating expectations that often does not matter.

The headline algorithm is Metropolis-Hastings. Pick a proposal kernel $q(x' \mid x)$ that suggests a new state $x'$ given the current state $x$. At each step:

Propose $X' \sim q(\cdot \mid X)$.
Compute the acceptance ratio $$\alpha = \min\!\left(1,\ \frac{\pi(X')}{\pi(X)} \cdot \frac{q(X \mid X')}{q(X' \mid X)}\right).$$
Accept $X'$ with probability $\alpha$; otherwise stay at $X$.

A short calculation (the detailed balance condition) shows the chain has $\pi$ as its stationary distribution. Crucially, the acceptance ratio depends on $\pi$ only through ratios, so the unknown normalising constant of $\pi$ cancels out. This is what makes MCMC indispensable for Bayesian inference, where the posterior $p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta)\, p(\theta)$ is known only up to a constant.

Gibbs sampling is the special case where the proposal is "resample one coordinate from its conditional given the others, holding the rest fixed". Cycle through coordinates and acceptance is automatic ($\alpha = 1$). Gibbs is the engine behind Latent Dirichlet Allocation and many graphical-model posteriors.

Hamiltonian Monte Carlo (HMC) is the modern champion for continuous high-dimensional posteriors. It uses the gradient of $\log \pi$ to simulate a fictional physical system, letting proposals travel large distances through the state space while preserving the right distribution. HMC and its adaptive variant NUTS (No-U-Turn Sampler) power Stan, PyMC, and most production Bayesian software.

The price MCMC pays is correlation and convergence diagnostics. You never know exactly when the chain has reached its stationary distribution, and consecutive samples are dependent. Practitioners run multiple chains, compute $\hat{R}$ statistics, discard burn-in, and thin samples to manage these pathologies. Slow per-effective-sample, but: MCMC can sample from anything you can evaluate, even up to a constant. That generality is unique.

Sampling in deep generative models

Modern deep generative models do not fit neatly into the four-method taxonomy above. Instead, they replace the explicit sampler with a learned neural network whose forward pass is itself the sampling procedure.

Autoregressive models (GPT-style language models, PixelCNN, WaveNet) sample one element at a time, each conditional on its predecessors. The "sampling formula" is just a softmax at every step; the cleverness is in the architecture that produces each conditional.
Diffusion models start from pure Gaussian noise and iteratively denoise over many steps, each step a small conditional Gaussian whose mean is predicted by a U-Net. After a few hundred steps, what began as noise has become a sample from the data distribution.
Normalising flows apply an invertible neural transformation to a base sample (typically Gaussian). Because the transformation is invertible and tractable, both sampling and density evaluation are exact.
GANs train a generator network that maps Gaussian noise directly to data. There is no explicit density, only a sampler, and an adversarial loss that pushes its outputs to match the data distribution.
Variational autoencoders sample a latent code $\mathbf{z}$ from an approximate posterior, then push it through a decoder. The reparameterisation trick from earlier in this section is the mathematical key.

What unites these is that they sidestep the question "how do I sample from this complicated $\pi$?" by learning the sampler itself. The price is that the underlying probability density may be implicit, intractable, or only approximately defined. The reward is that, once trained, sampling is a single neural-network forward pass, and the resulting samples can be photorealistic images, fluent prose, or molecules with desired properties. Chapter 14 develops each family in full.

What you should take away

Inverse-CDF sampling is the cleanest method when the quantile function is available; for the Exponential$(\lambda)$ it gives $X = -\ln(1-U)/\lambda$ from a single uniform draw.
Box-Muller and Cholesky together let us sample from any multivariate Gaussian, and the same algebra, viewed from a different angle, becomes the reparameterisation trick that makes VAEs trainable.
Rejection sampling is exact but its acceptance rate $1/M$ collapses in high dimensions, so it is rare in modern deep learning and common in textbook examples.
Importance sampling trades direct sampling for reweighting, is unbiased in principle, but its variance hinges on how well the proposal $q$ matches the target $\pi$; PPO, REINFORCE, and off-policy RL are all importance sampling at heart.
MCMC (Metropolis-Hastings, Gibbs, HMC) gives you samples from any distribution you can evaluate up to a constant, slowly, with correlated draws, and underwrites most modern Bayesian inference.