Chapter Fourteen

Generative Models

Learning Objectives
  1. Explain the encoder–decoder structure of autoencoders and the variational extension (VAE) with its ELBO loss
  2. Describe the generator–discriminator adversarial game of GANs and common training pathologies
  3. Outline the forward (noising) and reverse (denoising) processes of diffusion models
  4. Introduce normalising flows and their use of invertible transformations for exact likelihood
  5. Compare generative approaches for text, images, and audio on sample quality, diversity, and likelihood

A classifier looks at a picture and says "cat." A generative model does the reverse: it creates a picture of a cat that never existed. It learns the pattern behind the training data and then draws new examples from that pattern.

This power to create has made generative modelling one of the most visible parts of AI. Text-to-image tools make lifelike art from a sentence. Language models write smooth prose. Drug pipelines propose new molecules. The uses span science and art alike.

This chapter covers the main families of generative models: autoencoders and VAEs, GANs, diffusion models, normalising flows, and language generation.

14.1   Autoencoders

An autoencoder is a neural network trained to copy its input through a bottleneck. It has two parts:

  • Encoder fθ: maps input x to a lower-dimensional latent code z = fθ(x).
  • Decoder gφ: maps z back to a reconstruction = gφ(z).

Training minimises the reconstruction loss — typically mean squared error ‖x‖^2^. If the latent space is smaller than the input, the autoencoder must learn to compress the data into its most important features. This is nonlinear dimensionality reduction, generalising PCA.

The Problem with Plain Autoencoders

Standard autoencoders learn useful representations but are poor generators. Their latent space is unstructured. Interpolating between two codes may produce garbage. A random sample from latent space may decode to nothing useful.

Variational Autoencoders (VAEs)

The VAE Kingma, 2013 fixes this by imposing structure on the latent space. Instead of encoding to a single point, the encoder outputs the parameters of a Gaussian: mean μ and variance σ^2^. The latent code is sampled: z ~ N(μ, σ^2^I).

The loss function is the evidence lower bound (ELBO):

ℒ = 𝔼[log p(x|z)] − KL(q(z|x) || p(z))

The first term is reconstruction quality. The second term pulls the latent distribution toward a standard Gaussian prior, keeping the space smooth and regular.

Why "Evidence Lower Bound"?

You want to maximise the marginal log-likelihood log p(x), but the integral over z is intractable. By introducing an approximate posterior q(z|x) and applying Jensen's inequality, you get a lower bound on log p(x). That is the ELBO. Maximising it simultaneously improves the likelihood and tightens the approximation.

The Reparameterisation Trick

Sampling z ~ N(μ, σ^2^I) blocks gradient flow. The trick: rewrite it as z = μ + σε, where ε ~ N(0, I). Now the randomness comes from a fixed noise source, and the dependence on μ and σ is differentiable. The whole model trains end-to-end with backpropagation.

Limitations and Extensions

VAEs tend to produce blurry images. The pixel-wise reconstruction loss averages over uncertainty, smearing detail. The KL term can also limit expressiveness — posterior collapse occurs when the encoder ignores the input and maps everything to the prior.

Extensions:

  • β-VAE: adjusts the KL weight to trade reconstruction against disentanglement.
  • VQ-VAE Oord, 2017: replaces the continuous latent space with a discrete codebook. Produces sharper outputs.
  • Hierarchical VAEs: stack multiple latent levels to capture structure at different scales.

Why Autoencoders Still Matter

Despite lower sample quality than GANs or diffusion models, autoencoders remain key. VQ-VAE is the image tokeniser in Stable Diffusion Rombach, 2022 — it compresses images into a latent space where the diffusion model runs, cutting compute by a large factor. The idea of "learned compression" is one of the most useful concepts in the field.

14.2   Generative Adversarial Networks

GANs Goodfellow, 2014 recast generation as a two-player game:

  • Generator G: takes random noise z and produces a fake sample G(z).
  • Discriminator D: takes either a real or fake sample and tries to tell them apart.

The generator tries to fool the discriminator. The discriminator tries not to be fooled. At equilibrium, the generator produces samples indistinguishable from real data.

The objective: minG maxD 𝔼[log D(x)] + 𝔼[log(1 − D(G(z)))].

Training Difficulties

In theory, GANs converge to the data distribution. In practice, training was notoriously unstable:

  • Mode collapse: the generator produces only a few types of output, ignoring large parts of the data distribution.
  • Vanishing gradients: if the discriminator becomes too strong, it gives the generator no useful signal.
  • Balancing: generator and discriminator must stay roughly matched. Too much of either causes problems.

Taming GAN Training

A series of innovations improved stability:

  • DCGAN Radford, 2015: architectural guidelines (batch norm, strided convolutions, specific activations).
  • WGAN: replaced the JS divergence with the Wasserstein distance for smoother gradients.
  • Spectral normalisation: constrained the discriminator's Lipschitz constant.
  • Progressive growing: trained at increasing resolutions, from 4×4 up to 1024×1024.

StyleGAN

StyleGAN Karras, 2019 and its successors reached the peak of GAN image quality. A mapping network transforms noise into an intermediate latent space W, injected at multiple scales. This disentangles high-level attributes (pose, identity) from fine details (freckles, hair strands). The images were realistic enough to pass casual inspection — raising serious concerns about deepfakes.

Conditional GANs

Provide extra information (class label, text, segmentation map) to both generator and discriminator:

  • Pix2pix: paired image-to-image translation (edges → photos).
  • CycleGAN: unpaired translation using cycle-consistency (photos → Monet paintings).
  • Text-to-image GANs: laid the groundwork for DALL·E and its successors.

GANs Today

Diffusion models have largely passed GANs for image work, with better training and mode coverage. But GANs are still faster — one forward pass, not hundreds of denoising steps. This keeps them useful for real-time tasks like game art and live design tools.

14.3   Diffusion Models

Diffusion models Ho, 2020 generate data by learning to reverse a gradual noising process.

Forward Process

Start with a clean sample x0. Add small amounts of Gaussian noise over T steps, producing x1, x2, …, xT. By step T, the result is pure noise. The noise schedule β1, …, βT controls how much noise is added at each step.

A key property: you can jump directly to any step without simulating all previous ones. The distribution at step t is: q(xt|x0) = N(xt; √t x0, (1 − t)I), where t = Πs=1^t^ (1 − βs).

Reverse Process

A neural network learns to denoise: given a noisy xt and the time step t, predict the noise that was added. Starting from pure noise xT ~ N(0, I), the network iteratively removes noise to produce a clean sample.

The training loss is simple: mean squared error between predicted and actual noise:

ℒ = 𝔼t,x0,ε[‖εεθ(xt, t)‖^2^]

Architecture

The denoising network is typically a U-Net Ronneberger, 2015: a convolutional encoder-decoder with skip connections, plus self-attention layers. The time step is encoded with sinusoidal embeddings (borrowed from the Transformer). For conditional generation, text embeddings are injected via cross-attention.

Classifier-free guidance Ho, 2022 trains the model both with and without the conditioning signal, then amplifies the difference at inference. This sharpens the model's adherence to the condition (e.g., "a photo of a dog wearing a hat").

Faster Sampling

The original DDPM needs hundreds of denoising steps — much slower than a GAN. Solutions:

  • DDIM: reinterprets the reverse process as deterministic, allowing larger steps.
  • Consistency models (Song et al., 2023): map any noisy sample directly to the clean data in one step.
  • Latent diffusion Rombach, 2022: run diffusion in the compressed latent space of a pre-trained autoencoder, reducing dimensionality by 48× or more. This is how Stable Diffusion works.

Applications

Diffusion models have achieved state-of-the-art results across many domains:

  • Image synthesis: DALL·E 2, Stable Diffusion, Imagen, Midjourney
  • Video generation: Sora and others extend diffusion to the temporal domain
  • Audio: high-quality speech and music synthesis
  • Science: candidate molecular structures, protein conformations, new materials

The mix of a simple training loss, stable learning (no two-player game), and full mode coverage has made diffusion the leading method for generation as of the mid-2020s.

Theoretical Foundation

Song et al. (2021) Song, 2020 showed that the forward and reverse processes can be described by continuous-time stochastic differential equations (SDEs). The model learns a time-dependent score function ∇x log pt(x). Generation means solving the reverse-time SDE from noise. This continuous view enables adaptive ODE solvers for faster sampling and provides formal convergence guarantees.

14.4   Normalising Flows

A normalising flow learns an invertible transformation between a simple base distribution (typically a standard Gaussian) and the complex data distribution.

If fθ is invertible and maps z to x, the density of x is given by the change-of-variables formula:

px(x) = pz(fθ^−1^(x)) |det ∂fθ^−1^/∂x|

This gives exact log-likelihood — a major advantage over GANs (no density) and VAEs (only a lower bound).

The Jacobian Challenge

A general d-dimensional transformation has a d × d Jacobian. Computing its determinant costs O(d^3^) — too expensive for images. The solution: design transformations with triangular Jacobians, where the determinant is just the product of the diagonal, computable in O(d).

Coupling Layers

NICE and RealNVP Dinh, 2016 achieve this by splitting the input in half. One half is transformed (conditioned on the other); the other half passes through unchanged. The Jacobian is triangular by construction.

Autoregressive Flows

Each output dimension depends only on preceding dimensions, producing a triangular Jacobian naturally. MAF is fast for density evaluation but slow for sampling. IAF reverses this trade-off. Neural spline flows (Durkan et al., 2019) use monotonic splines for more flexible transformations.

Continuous Normalising Flows

Chen et al. (2018) Chen, 2018 defined the transformation as the solution to an ODE: dx/dt = vθ(x, t). The density change depends only on the trace of the Jacobian (cheaper than the determinant). Flow matching (Lipman et al., 2023) avoids expensive ODE integration during training by regressing directly onto a target velocity field.

Where Flows Excel

Flows are strongest where exact likelihood matters or where invertibility is directly useful:

  • Variational inference: flexible approximate posteriors beyond simple Gaussians.
  • Physics simulations: efficient sampling of Boltzmann distributions.
  • Anomaly detection: the exact log-likelihood serves as an anomaly score.

For image generation, flows have lagged behind GANs and diffusion models. The invertibility constraint prevents the generator from discarding information, limiting what it can learn. But the continuous flow perspective has merged with diffusion models, and the boundaries between these approaches continue to blur.

14.5   Language Generation

Language is different from images. The output space is a vocabulary of discrete tokens, and generation happens one token at a time. The dominant approach is autoregressive modelling:

p(x1, …, xn) = Πt=1^n^ p(xt|x1, …, xt−1)

A neural network (the Transformer decoder with a causal mask) parameterises each conditional distribution.

Training

Training is straightforward: given a sequence of tokens, predict the next one at each position using cross-entropy loss. Although generation is sequential, training computes all positions in parallel (thanks to the causal mask). The model learns from a massive corpus — web text, books, code — absorbing knowledge about language, facts, reasoning, and style.

Sampling Strategies

How you sample from the predicted distribution matters enormously:

  • Greedy decoding: always pick the most probable token. Produces repetitive, degenerate text.
  • Random sampling: sample from the full distribution. Diverse but often incoherent.
  • Temperature scaling: temperature < 1 makes the distribution sharper (more conservative); > 1 makes it flatter (more creative).
  • Top-k sampling: restrict to the k most probable tokens.
  • Nucleus (top-p) sampling: restrict to the smallest set of tokens whose cumulative probability exceeds p. With p between 0.9 and 0.95, this tends to give the best balance of coherence and diversity.

Beam Search

Beam search maintains k candidate sequences and extends each by the most probable next tokens. It produces more fluent text than greedy decoding but tends toward bland, generic output. Holtzman et al. (2020) Holtzman, 2019 showed that human text has surprising local unpredictability, and that sampling strategies mimicking this produce more natural-sounding text.

Improving Generation

  • Prompt engineering: shape the output by crafting the input context.
  • Constrained decoding: enforce structure (valid JSON, rhyme scheme) by modifying token probabilities.
  • Speculative decoding: a small "draft" model proposes several tokens at once. The large model verifies them in parallel, accepting correct ones and resampling the rest. This can cut latency by 2× or more without changing the output distribution.

The Bigger Picture

Modern language models write text of striking quality across nearly any topic. They power chatbots, code tools, writing aids, and summary systems. But fluency is not understanding. Language models make things up, reflect training biases, and cannot check their own claims. Chapter 15 covers the fixes: grounding, alignment, and retrieval.