Diffusion Model, Glossary, Textbook of AI

Diffusion models are a family of generative models that learn to invert a forward noise-adding process. The forward process gradually adds Gaussian noise to data over many timesteps until the result is pure noise; a neural network is trained to predict the noise added at each step (equivalently, to predict the denoised data). At sampling time, the network iteratively denoises pure noise back to clean samples.

The modern formulation was set out by Ho, Jain and Abbeel in their 2020 paper Denoising Diffusion Probabilistic Models (DDPM). Earlier related work (Sohl-Dickstein et al., 2015) had introduced the basic idea but had not produced competitive sample quality. DDPM and successors (improved DDPM, classifier-free guidance, score-based generative models) achieved sample quality that decisively surpassed GANs on many image-generation benchmarks by 2022.

Diffusion models displaced GANs as the dominant image-generation paradigm in 2022 with the release of DALL-E 2 (April 2022, OpenAI), Imagen (May 2022, Google), Midjourney (July 2022) and Stable Diffusion (August 2022, Stability AI). Stable Diffusion's release under an open licence with weights downloadable to any consumer GPU was particularly transformative, seeding an enormous ecosystem of open-source image-generation tools.

Diffusion models have since been extended to: Video generation (Sora, Veo, Runway Gen-3, Stable Video Diffusion); 3D generation (Magic3D, DreamFusion, GET3D); Audio generation (AudioLDM, MusicLDM); Protein structure (RFDiffusion, Chroma); Robotics (Diffusion Policy); Language modelling (Diffusion-LM, though autoregressive Transformers remain dominant).

The mathematical theory of diffusion models has unified perspectives from variational inference, score matching, stochastic differential equations and Langevin dynamics, making diffusion one of the most theoretically rich generative-modelling frameworks.

Mathematics

A denoising diffusion probabilistic model (DDPM) defines a forward Markov chain that gradually adds Gaussian noise to data over $T$ timesteps:

$$q(x_t | x_{t-1}) = \mathcal{N}\bigl(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I\bigr)$$

where $\beta_t \in (0, 1)$ is the noise schedule. With $\alpha_t = 1 - \beta_t$ and $\bar\alpha_t = \prod_{s=1}^t \alpha_s$, the marginal at time $t$ has the closed form

$$q(x_t | x_0) = \mathcal{N}\bigl(x_t; \sqrt{\bar\alpha_t} \, x_0, (1 - \bar\alpha_t) I\bigr)$$

so we can sample $x_t$ directly from $x_0$ as $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$.

The reverse process is parameterised by a neural network predicting the noise:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}\bigl(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)\bigr).$$

Training minimises a simplified version of the variational lower bound, a denoising objective:

$$\mathcal{L}_{\mathrm{simple}} = \mathbb{E}_{t, x_0, \epsilon}\!\left[\bigl\| \epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon, t) \bigr\|^2\right]$$

with $t \sim \mathrm{Uniform}\{1, \ldots, T\}$. The network $\epsilon_\theta$ predicts the noise that was added; sampling iterates

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \!\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z, \quad z \sim \mathcal{N}(0, I)$$

starting from pure noise $x_T \sim \mathcal{N}(0, I)$ and denoising step by step.

Classifier-free guidance (Ho and Salimans, 2022) trains a single conditional model $\epsilon_\theta(x_t, t, c)$ with the conditioning $c$ randomly dropped; sampling combines conditional and unconditional predictions:

$$\tilde \epsilon_\theta(x_t, t, c) = (1 + w) \epsilon_\theta(x_t, t, c) - w \epsilon_\theta(x_t, t, \emptyset)$$

with guidance scale $w$. This is what powers Stable Diffusion, Imagen and DALL-E 3's text-conditioning.

DDIM (Song et al., 2021) reformulates as a deterministic process and reduces sampling steps from thousands to tens. Latent diffusion (Stable Diffusion) runs the entire process in a learned compressed latent space rather than pixel space, reducing compute by orders of magnitude.

Interactive

Diffusion sampling, from noise to image. Start at pure Gaussian noise, denoise step by step, and a structure emerges.

The forward diffusion process: adding noise step by step. An image gradually corrupted by Gaussian noise becomes pure static.

Video

Discussed in:

Chapter 14: Generative Models, Generative Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.