Start at pure Gaussian noise, denoise step by step, and a structure emerges.
From the chapter: Chapter 14: Generative Models
Glossary: diffusion model, ddpm, score matching, classifier free guidance
People: jonathan ho, yang song
References: Ho, 2020
Transcript
A diffusion model generates images by reversing a noise process. Training adds noise to images; sampling removes it.
The starting point is pure Gaussian noise. There is no image yet, just static.
At each step, the trained network predicts how much noise is in the current image, and a portion of that noise is subtracted.
Early steps barely change the picture. The schedule deliberately removes only a little, letting the model form a coarse guess at the overall structure.
Mid-way through, low-frequency content appears. A rough shape begins to emerge from the noise.
Toward the end, fine details snap into place. After about sixty steps the image is complete.
That is one sample. Stable Diffusion, DALL-E and the latest video models all use this same idea: train a denoiser, then run it iteratively from noise.