Also known as: VAE
A variational autoencoder (VAE), introduced by Diederik Kingma and Max Welling in 2013, is a deep generative model that learns a probabilistic mapping between data and a continuous latent space. The model assumes data x is generated by sampling a latent z from a prior p(z) (typically standard Gaussian) and then sampling x from a likelihood p_θ(x | z) parameterised by a decoder neural network.
Direct maximum-likelihood training is intractable because the marginal p_θ(x) = ∫ p_θ(x | z) p(z) dz cannot be computed. The VAE introduces an encoder network producing parameters of an approximate posterior q_φ(z | x), and trains both networks jointly to maximise the evidence lower bound (ELBO): ELBO = E_{z ~ q_φ(z|x)} [log p_θ(x | z)] − KL(q_φ(z | x) || p(z)). The first term is a reconstruction quality measure; the second regularises the encoder towards the prior.
The crucial innovation that made VAEs trainable is the reparameterisation trick: rather than sampling z directly from q_φ(z | x), sample ε from a fixed noise distribution and compute z = μ_φ(x) + σ_φ(x) ⊙ ε. The gradient now flows through μ and σ in the standard way; backpropagation through stochastic latent variables becomes straightforward.
VAEs trade off sample quality (typically blurrier than GAN outputs) against tractable likelihood and stable training. They have been a foundational building block in deep generative modelling and underlie many subsequent developments, VQ-VAE (Van den Oord et al., 2017), the encoder side of latent diffusion models (Stable Diffusion's autoencoder is essentially a high-quality VAE), and many disentangled-representation learning methods.
Mathematics
A VAE assumes data $x$ is generated by sampling latent $z \sim p(z)$ (prior, typically $\mathcal{N}(0, I)$) and then $x \sim p_\theta(x | z)$ (decoder, parameterised by neural network $\theta$). Direct maximum-likelihood training requires the marginal
$$p_\theta(x) = \int p_\theta(x | z) p(z) \, dz$$
which is intractable for non-trivial models.
The VAE introduces an encoder $q_\phi(z | x)$ approximating the true posterior, and trains both networks jointly to maximise the evidence lower bound (ELBO):
$$\mathcal{L}_{\mathrm{ELBO}}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}\!\left[\log p_\theta(x | z)\right] - D_{\mathrm{KL}}\!\left(q_\phi(z|x) \,\|\, p(z)\right).$$
The first term is a reconstruction quality measure; the second is a regulariser pulling the encoder posterior towards the prior. By Jensen's inequality, $\log p_\theta(x) \geq \mathcal{L}_{\mathrm{ELBO}}$, maximising the ELBO maximises a lower bound on the log-likelihood.
The reparameterisation trick makes the ELBO differentiable through the stochastic latent $z$. For a Gaussian encoder $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$, sample
$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I).$$
The randomness is now in $\epsilon$ (independent of parameters) and gradients flow through $\mu_\phi$ and $\sigma_\phi$ in the standard way.
For a Gaussian encoder against a standard Gaussian prior, the KL term has a closed form:
$$D_{\mathrm{KL}}\bigl(\mathcal{N}(\mu, \sigma^2) \,\|\, \mathcal{N}(0, 1)\bigr) = \frac{1}{2} \sum_i \bigl(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\bigr).$$
Modern variants include $\beta$-VAE (weight the KL term, $\beta \neq 1$), VQ-VAE (discrete latents via vector quantisation), and the encoder-decoder pair underlying Stable Diffusion's latent space.
Video
Related terms: Autoencoder, diederik-kingma, max-welling, Generative Adversarial Network, Diffusion Model
Discussed in:
- Chapter 14: Generative Models, Autoencoders
- Chapter 14: Generative Models, Generative Models