A Diffusion Model generates data by learning to reverse a gradual noising process. The forward process takes a clean sample $\mathbf{x}_0$ and progressively adds Gaussian noise over $T$ time steps, producing $\mathbf{x}_1, \ldots, \mathbf{x}_T$, where $\mathbf{x}_T$ is approximately pure noise. The reverse process, parameterised by a neural network (typically a U-Net), learns to denoise: starting from noise, it iteratively removes noise to recover a sample from the data distribution.
Training is surprisingly simple. A crucial property of the forward process is that the marginal at any time step has a closed form, so the model can be trained by: sample a random time step $t$, add the appropriate amount of noise to a clean example, and train the network to predict the noise that was added. The loss is just mean squared error between predicted and actual noise. This simple objective—predict the noise—along with stable training and excellent mode coverage has made diffusion models the dominant generative paradigm of the 2020s.
Diffusion models power DALL·E 2, Stable Diffusion, Imagen, Midjourney, and Sora. Classifier-free guidance trains the model with and without conditioning, then amplifies the difference at inference time to sharpen conditional generation. Latent diffusion runs the diffusion process in a VAE's compressed latent space rather than pixel space, dramatically reducing cost. DDIM and consistency models reduce the many denoising steps to few or even one. Diffusion models have extended beyond images to video (Sora), audio, molecules, protein structures, and materials.
Related terms: Variational Autoencoder, Generative Adversarial Network, Generative Model
Discussed in:
- Chapter 14: Generative Models — Diffusion Models
Also defined in: Textbook of AI