14.1 What is generative modelling?

Let $\mathcal{X}$ denote the data space, the space of natural images, of English sentences, of protein structures, of audio waveforms. We have access to a finite training set $\{x_1, x_2, \ldots, x_N\} \subset \mathcal{X}$, drawn independently and identically from some true distribution $p_{\text{data}}(x)$ that we cannot inspect directly. Our task is to construct a model distribution $p_\theta(x)$, parametrised by weights $\theta$, that we can either evaluate or sample from, and that is in some sense close to $p_{\text{data}}$.

There are two core operations one might wish to perform on such a model:

  1. Density evaluation: given a query point $x$, return $p_\theta(x)$ (or its log).
  2. Sampling: draw fresh $x \sim p_\theta(\cdot)$, ideally efficiently.

Different model families make different trade-offs between these operations.

Explicit-density models

An explicit-density model writes down $p_\theta(x)$ as a closed mathematical expression that can be evaluated for any $x$. Within explicit-density models we distinguish two sub-families:

  • Tractable density: $p_\theta(x)$ can be evaluated exactly in polynomial time. Autoregressive models (PixelCNN, GPT) and normalising flows belong here. Maximum-likelihood training $\theta^* = \arg\max_\theta \sum_i \log p_\theta(x_i)$ is then directly possible.
  • Approximate / variational density: $p_\theta(x)$ has a closed form but evaluating it requires marginalising over latent variables, an integral that is intractable in general. The variational autoencoder is the canonical example. Training optimises a lower bound on the log-likelihood rather than the log-likelihood itself.

Implicit-density models

An implicit-density model never writes down $p_\theta(x)$ at all. It defines a sampling procedure, typically push a noise vector through a neural network, and trains so that samples become indistinguishable from data by some criterion. GANs are implicit-density; energy-based models occupy a middle ground (they specify the unnormalised density $\tilde p_\theta(x)$ but the partition function $Z_\theta$ that normalises it is intractable).

Why this distinction matters

The four operations one might want, training, sampling, density evaluation, latent inference, have very different costs in different families. An autoregressive image model gives you exact log-likelihood and stable training but slow sampling (one pixel at a time). A GAN gives you fast sampling but no likelihood and famously fragile training. A diffusion model gives you stable training and excellent sample quality but slow sampling (hundreds of steps). A normalising flow gives you exact likelihood and fast sampling, but pays a cost in expressiveness.

There is no universally best generative model. There is only the right tool for the question.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).