Also known as: VI
Variational inference (VI) is the alternative to MCMC for approximate Bayesian inference: rather than sampling from the true posterior $p(z | x)$, fit an approximate posterior $q_\phi(z)$ from a tractable family by optimisation.
The objective is to minimise the KL divergence
$$\phi^* = \arg\min_\phi D_{\mathrm{KL}}(q_\phi(z) \| p(z | x))$$
Direct minimisation requires the (intractable) marginal $p(x)$. Algebraic manipulation gives an equivalent problem, maximise the evidence lower bound (ELBO):
$$\mathcal{L}(\phi) = \mathbb{E}_{z \sim q_\phi}[\log p(x, z)] - \mathbb{E}_{z \sim q_\phi}[\log q_\phi(z)]$$
$$= \log p(x) - D_{\mathrm{KL}}(q_\phi(z) \| p(z | x))$$
Maximising the ELBO is equivalent to minimising KL, and ELBO is computable because $\log p(x, z)$ is the (tractable) joint log-density.
Mean-field approximation: assume the variational posterior factorises across coordinates:
$$q_\phi(z) = \prod_i q_{\phi_i}(z_i)$$
Iteratively update each $q_{\phi_i}$ given the others, the coordinate-ascent variational inference (CAVI) algorithm, gives optimal updates of the form
$$q^*(z_i) \propto \exp\bigl(\mathbb{E}_{q_{-i}}[\log p(x, z)]\bigr)$$
For exponential-family models with conjugate priors, these expectations have closed forms.
Stochastic variational inference (SVI) (Hoffman, Blei, Wang, Paisley 2013) uses noisy gradients of the ELBO computed on mini-batches of data, scaling VI to massive datasets.
Black-box VI (BBVI, Ranganath et al. 2014) uses the score-function gradient
$$\nabla_\phi \mathcal{L} = \mathbb{E}_{q_\phi}[(\log p(x, z) - \log q_\phi(z)) \nabla_\phi \log q_\phi(z)]$$
with control variates for variance reduction. BBVI works for any probabilistic model where one can sample from $q$ and compute log-densities , no model-specific derivations required.
Amortised VI parameterises $q_\phi(z | x)$ as a neural-network function of the observed data (an encoder), so inference for a new $x$ is a single forward pass rather than a per-example optimisation. The VAE is the canonical example: encoder produces $\mu, \sigma$ of a Gaussian variational posterior; reparameterisation trick enables gradient flow.
VI is generally biased (the chosen variational family typically does not contain the true posterior) but fast, much faster than MCMC for large models. The choice between MCMC and VI is a methodological one: MCMC for accuracy when computation permits, VI for speed and scale.
Video
Related terms: Variational Autoencoder, MCMC, KL Divergence
Discussed in:
- Chapter 4: Probability, Probability