MCMC, Glossary, Textbook of AI

Also known as: Markov chain Monte Carlo

Markov-chain Monte Carlo (MCMC) methods sample from a target distribution $\pi(x)$, known up to a normalising constant, by constructing a Markov chain whose stationary distribution is $\pi$. After running the chain long enough to reach equilibrium ("mixing"), states sampled from the chain are approximately samples from $\pi$.

Metropolis–Hastings algorithm. Choose a proposal distribution $q(x' | x)$ from which it is easy to sample. Iterate:

Sample candidate $x' \sim q(x' | x_t)$.
Compute acceptance probability

$$\alpha = \min\!\left(1, \frac{\pi(x') q(x_t | x')}{\pi(x_t) q(x' | x_t)}\right)$$

With probability $\alpha$, accept ($x_{t+1} = x'$); else reject ($x_{t+1} = x_t$).

The chain provably converges to $\pi$ as its stationary distribution under mild conditions. A symmetric proposal $q(x' | x) = q(x | x')$ simplifies the acceptance to the Metropolis ratio $\pi(x') / \pi(x_t)$.

Gibbs sampling is the special case where each step updates one variable at a time, sampling from its conditional distribution given the others:

$$x_i^{(t+1)} \sim P(x_i \mid x_{-i}^{(t)})$$

Gibbs sampling has acceptance probability 1 (every proposal is accepted) when the conditionals are tractable. It is the standard MCMC method for Bayesian networks, latent Dirichlet allocation (collapsed Gibbs) and many other graphical-model inference problems.

Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal 2010) introduces auxiliary momentum variables and uses Hamiltonian dynamics to propose distant moves with high acceptance rates. The No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014) automatically tunes HMC's step size and trajectory length and is the engine of probabilistic programming languages like Stan and PyMC.

Practical concerns:

Burn-in: discard initial samples while the chain is mixing.
Thinning: keep every $k$-th sample to reduce autocorrelation.
Convergence diagnostics: $\hat R$ (Gelman-Rubin), effective sample size, trace plots.
Multiple chains: run 4-8 chains from different starting points to detect non-convergence.

Modern uses in AI:

Bayesian deep learning sampling from posterior over weights (HMC, SGLD).
Energy-based models sampling via Langevin dynamics: $x_{t+1} = x_t + \epsilon \nabla \log p(x_t) + \sqrt{2\epsilon} \, z$.
Diffusion models at sampling time use a closely related stochastic process.

Stochastic gradient Langevin dynamics (SGLD) combines SGD with Langevin noise to sample from an approximate posterior, scaling MCMC to neural-network-sized parameter spaces.

Video

Related terms: Gibbs Sampling, Variational Inference

Discussed in:

Chapter 4: Probability, Probability

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).