Variational Autoencoder, Glossary, Textbook of AI

Also known as: VAE

A variational autoencoder (VAE), introduced by Diederik Kingma and Max Welling in 2013, is a deep generative model that learns a probabilistic mapping between data and a continuous latent space. The model assumes data x is generated by sampling a latent z from a prior p(z) (typically standard Gaussian) and then sampling x from a likelihood p_θ(x | z) parameterised by a decoder neural network.

Direct maximum-likelihood training is intractable because the marginal p_θ(x) = ∫ p_θ(x | z) p(z) dz cannot be computed. The VAE introduces an encoder network producing parameters of an approximate posterior q_φ(z | x), and trains both networks jointly to maximise the evidence lower bound (ELBO): ELBO = E_{z ~ q_φ(z|x)} [log p_θ(x | z)] − KL(q_φ(z | x) || p(z)). The first term is a reconstruction quality measure; the second regularises the encoder towards the prior.

The crucial innovation that made VAEs trainable is the reparameterisation trick: rather than sampling z directly from q_φ(z | x), sample ε from a fixed noise distribution and compute z = μ_φ(x) + σ_φ(x) ⊙ ε. The gradient now flows through μ and σ in the standard way; backpropagation through stochastic latent variables becomes straightforward.

VAEs trade off sample quality (typically blurrier than GAN outputs) against tractable likelihood and stable training. They have been a foundational building block in deep generative modelling and underlie many subsequent developments, VQ-VAE (Van den Oord et al., 2017), the encoder side of latent diffusion models (Stable Diffusion's autoencoder is essentially a high-quality VAE), and many disentangled-representation learning methods.

Mathematics

A VAE assumes data $x$ is generated by sampling latent $z \sim p(z)$ (prior, typically $\mathcal{N}(0, I)$) and then $x \sim p_\theta(x | z)$ (decoder, parameterised by neural network $\theta$). Direct maximum-likelihood training requires the marginal

$$p_\theta(x) = \int p_\theta(x | z) p(z) \, dz$$

which is intractable for non-trivial models.

The VAE introduces an encoder $q_\phi(z | x)$ approximating the true posterior, and trains both networks jointly to maximise the evidence lower bound (ELBO):

$$\mathcal{L}_{\mathrm{ELBO}}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}\!\left[\log p_\theta(x | z)\right] - D_{\mathrm{KL}}\!\left(q_\phi(z|x) \,\|\, p(z)\right).$$

The first term is a reconstruction quality measure; the second is a regulariser pulling the encoder posterior towards the prior. By Jensen's inequality, $\log p_\theta(x) \geq \mathcal{L}_{\mathrm{ELBO}}$, maximising the ELBO maximises a lower bound on the log-likelihood.

The reparameterisation trick makes the ELBO differentiable through the stochastic latent $z$. For a Gaussian encoder $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$, sample

$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I).$$

The randomness is now in $\epsilon$ (independent of parameters) and gradients flow through $\mu_\phi$ and $\sigma_\phi$ in the standard way.

For a Gaussian encoder against a standard Gaussian prior, the KL term has a closed form:

$$D_{\mathrm{KL}}\bigl(\mathcal{N}(\mu, \sigma^2) \,\|\, \mathcal{N}(0, 1)\bigr) = \frac{1}{2} \sum_i \bigl(\mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1\bigr).$$

Modern variants include $\beta$-VAE (weight the KL term, $\beta \neq 1$), VQ-VAE (discrete latents via vector quantisation), and the encoder-decoder pair underlying Stable Diffusion's latent space.

Video

Related terms: Autoencoder, diederik-kingma, max-welling, Generative Adversarial Network, Diffusion Model

Discussed in:

Chapter 14: Generative Models, Autoencoders
Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).