Also known as: Markov chain Monte Carlo
Markov-chain Monte Carlo (MCMC) methods sample from a target distribution $\pi(x)$, known up to a normalising constant, by constructing a Markov chain whose stationary distribution is $\pi$. After running the chain long enough to reach equilibrium ("mixing"), states sampled from the chain are approximately samples from $\pi$.
Metropolis–Hastings algorithm. Choose a proposal distribution $q(x' | x)$ from which it is easy to sample. Iterate:
- Sample candidate $x' \sim q(x' | x_t)$.
- Compute acceptance probability
$$\alpha = \min\!\left(1, \frac{\pi(x') q(x_t | x')}{\pi(x_t) q(x' | x_t)}\right)$$
- With probability $\alpha$, accept ($x_{t+1} = x'$); else reject ($x_{t+1} = x_t$).
The chain provably converges to $\pi$ as its stationary distribution under mild conditions. A symmetric proposal $q(x' | x) = q(x | x')$ simplifies the acceptance to the Metropolis ratio $\pi(x') / \pi(x_t)$.
Gibbs sampling is the special case where each step updates one variable at a time, sampling from its conditional distribution given the others:
$$x_i^{(t+1)} \sim P(x_i \mid x_{-i}^{(t)})$$
Gibbs sampling has acceptance probability 1 (every proposal is accepted) when the conditionals are tractable. It is the standard MCMC method for Bayesian networks, latent Dirichlet allocation (collapsed Gibbs) and many other graphical-model inference problems.
Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal 2010) introduces auxiliary momentum variables and uses Hamiltonian dynamics to propose distant moves with high acceptance rates. The No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) automatically tunes HMC's step size and trajectory length and is the engine of probabilistic programming languages like Stan and PyMC.
Practical concerns:
- Burn-in: discard initial samples while the chain is mixing.
- Thinning: keep every $k$-th sample to reduce autocorrelation.
- Convergence diagnostics: $\hat R$ (Gelman-Rubin), effective sample size, trace plots.
- Multiple chains: run 4-8 chains from different starting points to detect non-convergence.
Modern uses in AI:
- Bayesian deep learning sampling from posterior over weights (HMC, SGLD).
- Energy-based models sampling via Langevin dynamics: $x_{t+1} = x_t + \epsilon \nabla \log p(x_t) + \sqrt{2\epsilon} \, z$.
- Diffusion models at sampling time use a closely related stochastic process.
Stochastic gradient Langevin dynamics (SGLD) combines SGD with Langevin noise to sample from an approximate posterior, scaling MCMC to neural-network-sized parameter spaces.
Video
Related terms: Gibbs Sampling, Variational Inference
Discussed in:
- Chapter 4: Probability, Probability