14.9 Diffusion models, DDPM

Diffusion sampling, from noise to image · 1:10. Start at pure Gaussian noise, denoise step by step, and a structure emerges. Open transcript and references →
If you trained a generative network in 2018 you reached for a GAN. If you trained one in 2021 you reached, perhaps reluctantly, for a VAE. If you train one in 2026 you almost certainly reach for a diffusion model. Stable Diffusion, DALL-E 3, Midjourney, Sora, Veo, Luma, AlphaFold 3, Stable Audio, RFDiffusion and the diffusion policies powering modern robot manipulation all share the same algorithmic spine. The dominant generative architecture of the 2020s is, by quite a margin, this one.

The idea, when one strips away the variational machinery and the score-based reformulations, is short. Take a clean image. Add a tiny amount of Gaussian noise. Add a little more. Add a little more. Keep going until what remains is indistinguishable from pure white noise. That is the forward process: a destruction of structure, performed gradually so that each step is small. Now train a neural network to undo a single step, to look at a slightly noisy image and predict the noise that was added. Once the network can do this reliably, you have a generative model. Sample a fresh draw of pure noise, ask the network to denoise it, repeat, and out the other end emerges a coherent image.

What makes diffusion different from its predecessors is not just that it works, but that it works stably. A GAN is a two-player game whose equilibrium is fragile and whose training can collapse without warning. A VAE forces a single bottleneck through which all of the world's variation must squeeze, and trades sharpness for coverage in ways that no engineer ever quite controls. A normalising flow buys exact likelihoods at the cost of an architecture too constrained to discard nuisance information. Diffusion training, by contrast, is just denoising regression, a task on which deep networks have been reliably good since the 1990s. There is no minimax. There is no partition function. There is no posterior collapse. Sample a clean image, sample a timestep, sample a noise vector, take a gradient step on mean-squared error. That is the entire algorithm.

The price for this stability is computational. A standard DDPM at full fidelity runs the denoising network a thousand times per sample. Latent diffusion (§14.12), DDIM (§14.11), DPM-Solver and consistency-model distillation have driven this number down to twenty, ten, and in some cases one, but the underlying compute cost remains the chief reason that pixel-space diffusion at $1024 \times 1024$ resolution is still expensive.

This section establishes the mathematics. The VAE machinery of §14.4 reappears here, since the diffusion training objective derives from the same evidence lower bound. §14.10 picks up classifier-free guidance and §14.12 covers latent-space diffusion.

Symbols Used Here
$\mathbf{x}_0$clean image
$\mathbf{x}_t$image at noise level $t$
$T$total number of noise steps
$\beta_t$noise schedule (small positive numbers)
$\alpha_t = 1 - \beta_t$$\bar\alpha_t = \prod_{s \le t}\alpha_s$
$\boldsymbol{\epsilon}$Gaussian noise
$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$noise predictor (a neural network)
$\mathcal{N}$Gaussian

Forward (noising) process

The forward process is a Markov chain that begins at a clean datum $\mathbf{x}_0 \sim p_{\text{data}}$ and progressively adds Gaussian noise. At each step, the previous image is shrunk slightly and topped up with fresh isotropic noise:

$$q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}\right).$$

The schedule $0 < \beta_1 < \beta_2 < \cdots < \beta_T < 1$ controls how aggressively the noise grows. The most common choice, the one in Ho et al.'s original 2020 paper, is linear from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ over $T = 1000$ steps. The shrinkage factor $\sqrt{1 - \beta_t}$ keeps the marginal variance bounded as the chain proceeds: without it, repeatedly adding noise would cause the variance to blow up unboundedly. With it, after a thousand steps the variance has settled to (approximately) that of a unit Gaussian, regardless of where it started.

So far this is just simulation. It would be expensive, to obtain $\mathbf{x}_t$ for some $t$ during training we would have to step through the chain from $\mathbf{x}_0$ to $\mathbf{x}_t$ by simulating each link. For $t$ near $T$ this is a thousand sequential operations per training example. Diffusion models are saved from this fate by a small algebraic miracle: the closed-form marginal.

Define $\alpha_t = 1 - \beta_t$ and $\bar\alpha_t = \prod_{s = 1}^{t} \alpha_s$. Then the distribution of $\mathbf{x}_t$ given $\mathbf{x}_0$ is itself Gaussian, and we can write down its mean and variance directly:

$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\, \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\, (1 - \bar\alpha_t)\mathbf{I}\right).$$

Equivalently, the reparameterised sample is

$$\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).$$

This expression says: the noisy image at step $t$ is a deterministic blend of the clean image (scaled by $\sqrt{\bar\alpha_t}$) and a fresh Gaussian noise vector (scaled by $\sqrt{1 - \bar\alpha_t}$). At $t = 0$ the second factor is zero and we recover $\mathbf{x}_0$; at $t = T$ the first factor is essentially zero and we recover pure noise. The intermediate points trace a smooth interpolation.

The proof is by induction. The base case is the definition of $q(\mathbf{x}_1 \mid \mathbf{x}_0)$. For the inductive step, assume the formula holds at $t-1$, so $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1 - \bar\alpha_{t-1}}\boldsymbol{\eta}$ for some $\boldsymbol{\eta} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Apply one further forward step: $\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\xi}$. Substituting and collecting noise terms, the variance contribution is $\alpha_t(1 - \bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t \bar\alpha_{t-1} + 1 - \alpha_t = 1 - \bar\alpha_t$, since $\alpha_t \bar\alpha_{t-1} = \bar\alpha_t$. Two independent Gaussians sum to a Gaussian whose variance is the sum, so the noise terms collapse into a single $\sqrt{1 - \bar\alpha_t}\boldsymbol{\epsilon}$ contribution. The mean is $\sqrt{\alpha_t \bar\alpha_{t-1}}\mathbf{x}_0 = \sqrt{\bar\alpha_t}\mathbf{x}_0$. Done.

The practical importance of this is hard to overstate. To train a diffusion model we need to be able to produce, at low cost, a noised version of every training image at a randomly selected timestep. Without the closed form, generating one training pair would take $O(T)$ network evaluations; with it, the cost is $O(1)$. A thousand-fold speedup of the training pipeline turns out to make the difference between "diffusion is an interesting curiosity" and "diffusion is the basis of every text-to-image system in production".

Reverse (denoising) process

The forward process destroys information; the reverse process must rebuild it. We want to learn the conditional distribution $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$, given a slightly more noisy image, recover the slightly less noisy one. Apply this conditional repeatedly, starting from pure noise, and the chain runs in reverse all the way back to a sample of the data distribution.

There is a difficulty: the true reverse conditional $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ depends on $p_{\text{data}}$ in a complicated way and is not Gaussian in general. (To see this, imagine running the forward process on a multimodal distribution: at intermediate noise levels several modes blur together, and the reverse must somehow disambiguate them.) But, and this is the second algebraic gift of the linear-Gaussian forward process, the posterior $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$, conditioned on knowing the original clean image as well as the noised one, is Gaussian, with a closed-form mean and variance:

$$q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \tilde\mu_t(\mathbf{x}_t, \mathbf{x}_0),\, \tilde\beta_t \mathbf{I}\right),$$

with

$$\tilde\mu_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1 - \bar\alpha_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t}\mathbf{x}_t, \qquad \tilde\beta_t = \frac{1 - \bar\alpha_{t-1}}{1 - \bar\alpha_t}\beta_t.$$

This Gaussian, the true reverse posterior, is what our learned reverse process should match. We parametrise

$$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\, \boldsymbol\Sigma_\theta(\mathbf{x}_t, t)\right),$$

and ask the network to learn $\boldsymbol{\mu}_\theta$ (and possibly $\boldsymbol\Sigma_\theta$, though Ho et al. simply set the variance to a fixed schedule). The functional form is fixed by the structure of the forward process; only the mean and (optionally) variance are learned.

To draw a sample, apply the learned reverse chain from pure noise back to data:

  1. Initialise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
  2. For $t = T, T-1, \ldots, 1$, predict the noise $\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$, compute the posterior mean from it, and sample $\mathbf{x}_{t-1}$ from the resulting Gaussian (or take the mean if $t = 1$).
  3. Return $\mathbf{x}_0$.

In equations, the per-step update is

$$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}_t, \qquad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \text{ if } t > 1,$$

with $\sigma_t^2$ taken either as $\tilde\beta_t$ (the posterior variance) or $\beta_t$ (the forward variance), both choices appear in the literature and produce indistinguishable samples in practice.

A useful intuition: each step pulls the noisy image a small amount in the direction of the predicted clean image, then adds back a small amount of independent noise. The pull is conservative, the network never claims to know the answer in one shot. The noise injection is deliberate: it gives the chain the variability it needs to explore the data distribution rather than collapsing onto a single mode. Setting $\sigma_t = 0$ everywhere yields a deterministic trajectory; this is the DDIM sampler (§14.11), which trades stochasticity for the ability to use far fewer steps without quality loss.

Training: simple denoising loss

The training objective falls out of the variational bound for a hierarchical latent-variable model whose latents are the noised images $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$. The same ELBO machinery that powered the VAE (§14.4) applies, with one twist: most of the conditional distributions in the chain are fixed by the forward process and need not be learned. The only unknowns are the reverse Gaussians $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$.

After several pages of bookkeeping, the full derivation is in Ho et al. (2020), the bound reduces to a sum over per-step KL divergences between the true posterior $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ and the learned $p_\theta$, plus boundary terms that turn out not to matter. For two Gaussians with the same variance, the KL is proportional to the squared difference of their means. Hence the loss is, up to constants,

$$\mathcal{L}_t \;\propto\; \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\bigl[\,\|\tilde\mu_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2\,\bigr].$$

So far this is just MSE on the posterior mean. The clever bit is the reparametrisation. Using the closed form $\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1 - \bar\alpha_t}\,\boldsymbol{\epsilon})/\sqrt{\bar\alpha_t}$, the true posterior mean rewrites as

$$\tilde\mu_t = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\,\boldsymbol{\epsilon}\right),$$

a function of the noised image $\mathbf{x}_t$ and the original noise $\boldsymbol{\epsilon}$ alone. Parametrising the network in the same form, predict $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ and plug it into the same expression, the squared-mean-difference loss reduces, after some algebra, to a squared-noise-difference loss. Ho et al. went one step further and dropped the $t$-dependent prefactor entirely. The training objective becomes the simple loss:

$$\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{\mathbf{x}_0,\, t,\, \boldsymbol{\epsilon}}\!\left[\,\bigl\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\,\boldsymbol{\epsilon},\, t\right)\bigr\|^2\,\right].$$

The expectation is over a clean image $\mathbf{x}_0$ from the dataset, a timestep $t$ uniform on $\{1, \ldots, T\}$, and a Gaussian noise vector $\boldsymbol{\epsilon}$. The network sees the noised image and the timestep, and must predict the noise that was added. Mean-squared error. That is all.

The training loop is correspondingly minimal:

for x0 in data_loader:
    t = randint(1, T)                              # random timestep
    epsilon = randn_like(x0)                       # random noise
    xt = sqrt(alpha_bar[t]) * x0 \
       + sqrt(1 - alpha_bar[t]) * epsilon          # add noise
    eps_pred = network(xt, t)                      # predict noise
    loss = mean((epsilon - eps_pred) ** 2)         # MSE
    loss.backward(); optimiser.step()

There is no adversarial game. There is no posterior to approximate. There is no partition function. There is no constraint on the network architecture beyond being able to take an image-shaped input and a scalar timestep, and produce an image-shaped output. The network is typically a U-Net with sinusoidal time embeddings and self-attention at the lower resolutions; for a Diffusion Transformer (DiT) the U-Net is replaced by a Transformer over patch tokens, but the loss is identical.

The dropping of the $t$-weight is empirical, not principled. The full variational bound puts more weight on smaller $t$ (where the residual signal is high and small errors matter most). Dropping the weight effectively up-weights large-$t$ losses, which improves perceptual sample quality at some cost to log-likelihood. Modern training pipelines often reintroduce schedule-dependent weightings (e.g. min-SNR weighting from Hang et al., 2023) that interpolate between $\mathcal{L}_{\text{simple}}$ and the variational bound and recover both good likelihoods and good samples.

Why this works

Why does noise prediction, of all things, give us a generative model? The answer is the score function.

Recall from §14.8 that the score of a distribution $p$ at $\mathbf{x}$ is the gradient of the log-density: $\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$. A vector field that, at every point, points uphill on the log-density, towards regions of higher probability. If we can estimate the score, we can ascend it to find typical samples; this is the basis of Langevin sampling, energy-based models, and score-based generative modelling generally.

The closed-form forward marginal lets us read off the score of the noisy distribution. Recall $q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0,\, (1 - \bar\alpha_t)\mathbf{I})$. The score is the gradient of the log-density of this Gaussian with respect to $\mathbf{x}_t$:

$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t \mid \mathbf{x}_0) = -\frac{\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0}{1 - \bar\alpha_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar\alpha_t}}.$$

The last equality uses the reparameterisation $\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0 = \sqrt{1 - \bar\alpha_t}\boldsymbol{\epsilon}$. Marginalising over $\mathbf{x}_0$ (which we cannot do analytically, but which Tweedie's identity handles in expectation) leaves the unconditional score $\nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t)$ as $-\mathbb{E}[\boldsymbol{\epsilon} \mid \mathbf{x}_t]/\sqrt{1 - \bar\alpha_t}$. Predicting the noise is equivalent, up to a $t$-dependent constant, to estimating the score of the noisy distribution.

This is the link to score matching (Hyvärinen, 2005; Vincent, 2011). Denoising score matching trains a network to denoise corrupted samples; the optimal denoiser is the conditional expectation $\mathbb{E}[\mathbf{x}_0 \mid \mathbf{x}_t]$, which is exactly the score, rescaled. Diffusion models are score matching, dressed up in different notation. Song & Ermon (2019, 2021) made this equivalence formal and unified DDPM with score-based models under a single SDE framework (§14.13).

The geometric picture: at each noise level $t$, the noisy distribution $q_t$ is a smoothed version of the data distribution. As $t$ grows, the smoothing becomes more aggressive; at $t = T$ the distribution is essentially a unit Gaussian. The score field of $q_t$ points, at every point, towards the local centre of mass of nearby data. The reverse SDE walks this score field upward, simultaneously reducing the noise level, at large $t$ the score is broad and uninformative, at small $t$ it is sharp and pointed at specific data examples. The trajectory is a noisy gradient ascent on a sequence of progressively sharper landscapes, and the sample emerges where the sharpest landscape has its mode.

This is also why diffusion models are good at coverage. A score field is global: every mode of the data distribution contributes to it, and the chain's stochasticity ensures that all modes are visited with roughly the right probability. GANs lose modes because the discriminator only ever sees a few of them; diffusion sees the whole distribution at every training step, in the form of the average noise to subtract. There is no incentive to drop a mode, doing so would make the noise prediction worse on examples from that mode.

Worked numerical sketch

Take a single one-dimensional pixel for clarity. Set $\mathbf{x}_0 = 5$ (a moderately bright value), $T = 1000$, with linear $\beta_t$ from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. We want to follow what happens to $\mathbf{x}_t$ as $t$ marches from $0$ to $T$.

At $t = 1$, $\bar\alpha_1 = \alpha_1 = 1 - 10^{-4} = 0.9999$, so $\mathbf{x}_1 \approx \sqrt{0.9999}\cdot 5 + \sqrt{0.0001}\cdot \boldsymbol{\epsilon} \approx 5 + 0.01\,\boldsymbol{\epsilon}$. Essentially unchanged.

At $t = 100$, $\bar\alpha_{100} = \prod_{s=1}^{100}\alpha_s \approx 0.9$ (for the linear schedule the product decays smoothly). So $\mathbf{x}_{100} \approx \sqrt{0.9}\cdot 5 + \sqrt{0.1}\cdot \boldsymbol{\epsilon} \approx 4.74 + 0.32\,\boldsymbol{\epsilon}$. Still recognisably bright, with a noticeable wobble.

At $t = 500$, $\bar\alpha_{500} \approx 0.08$. Hence $\mathbf{x}_{500} \approx \sqrt{0.08}\cdot 5 + \sqrt{0.92}\cdot \boldsymbol{\epsilon} \approx 1.4 + 0.96\,\boldsymbol{\epsilon}$. The original signal has decayed substantially; the noise dominates. If you sampled this many times you would see a Gaussian centred near $1.4$ with spread close to one; the original value of $5$ is no longer reliably recoverable from a single look.

At $t = T = 1000$, $\bar\alpha_T \approx 4 \times 10^{-5}$, effectively zero. So $\mathbf{x}_T \approx 0 + 1 \cdot \boldsymbol{\epsilon} \sim \mathcal{N}(0, 1)$. The original information is gone; what remains is white noise.

The signal-to-noise ratio $\mathrm{SNR}(t) = \bar\alpha_t / (1 - \bar\alpha_t)$ summarises this picture compactly. It starts at roughly $10^4$ for $t = 1$ (almost pure signal), passes through unity around $t \approx 260$ (noise and signal in balance), and decays to $\sim 4 \times 10^{-5}$ at $t = T$ (almost pure noise). Modern weighting schemes (min-SNR, EDM, v-prediction) all involve careful choices about how to weight the loss as a function of SNR, and these choices have proven to matter more than once thought.

The linear schedule used here is the original DDPM choice. In their 2021 paper Nichol and Dhariwal pointed out that for higher resolutions the linear schedule destroys information too quickly at the end of the chain, by the time you reach $t = T$ the SNR has been so small for so long that many of the late training steps are wasted. Their fix is the cosine schedule:

$$\bar\alpha_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos^2\!\left(\frac{t/T + s}{1 + s}\cdot \frac{\pi}{2}\right),$$

with a small offset $s = 0.008$ to prevent $\beta_t$ from being too small near $t = 0$. The cosine schedule keeps the SNR high for longer in the middle of the chain and decays smoothly to zero only at the very end. On ImageNet $256 \times 256$ this single change produces a measurable improvement in FID. It is now the default in nearly every modern diffusion implementation.

Latent diffusion (Stable Diffusion)

Operating in pixel space is wasteful. Most of the bits of a high-resolution image carry imperceptible high-frequency detail, sensor noise, fine texture, JPEG artefacts, that no human would notice were it perturbed. Yet the diffusion model must learn to model all of it, because every pixel contributes to the loss equally. Worse, the compute scales with the spatial resolution squared: doubling the side length of the image quadruples the cost per step. A model that handles $512 \times 512$ in pixel space will choke on $1024 \times 1024$.

Rombach et al. (2022) had the right idea: do the diffusion in a compressed latent space, not in pixels. Train a VAE-style autoencoder once, on a separate objective, that compresses a $512 \times 512 \times 3$ image into a $64 \times 64 \times 4$ latent: an 8-fold reduction along each spatial axis and a roughly $48 \times$ reduction in tensor volume. Then train a diffusion model in the latent space: $\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, y)$, where $\mathbf{z}_t$ is a noised latent. To sample, run the latent diffusion process from noise, then decode the final latent through the autoencoder back to pixels.

The autoencoder used in Stable Diffusion is not the basic VAE of §14.4. It is a KL-regularised continuous VAE with a carefully tuned reconstruction loss combining pixel-MSE, an LPIPS perceptual loss (deep features from a pretrained VGG), and a small adversarial term in the style of VQGAN (Esser et al., 2021). The adversarial component pushes the decoder to produce sharp textures rather than blurry MSE-optimal averages; the perceptual loss prevents the adversary from fixating on irrelevant high-frequency detail; the KL regularisation keeps the latent space well-distributed (close to a standard Gaussian) so that the diffusion model has an easy target distribution. The whole autoencoder is trained once and frozen.

The 48× spatial compression translates directly into compute savings. A pixel-space DDPM that needs eight A100 GPUs running for several weeks to converge becomes a latent-space model that trains in days on a much smaller cluster. Sampling becomes correspondingly cheap, a $1024 \times 1024$ generation with 50 DDIM steps takes a couple of seconds on a consumer GPU rather than a couple of minutes. This compute reduction was the unlock that made Stable Diffusion's open-weights release in mid-2022 the cultural event it became.

The architecture of the latent-space U-Net is otherwise standard: residual blocks, group normalisation, sinusoidal time embeddings, and self-attention at lower resolutions. The text conditioning enters via cross-attention: each Transformer-style block in the U-Net has self-attention over latent tokens and cross-attention over the text-token embeddings produced by a frozen CLIP encoder. The cross-attention layer is where the prompt actually steers the generation; everything else is the same diffusion machinery.

Subsequent latent-diffusion systems (SDXL, DiT-based models, Stable Diffusion 3 with its rectified-flow Transformer) refine this template in many directions but do not change the basic structure: a frozen autoencoder, a diffusion model in latent space, text conditioning via cross-attention.

Classifier-free guidance

One more piece is needed before a diffusion model becomes a prompt-following generator. A naively conditioned diffusion model trained with $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)$ produces samples that are roughly faithful to the conditioning $y$ but not crisply so. The samples drift; the prompt's specifics get smoothed out by the denoising chain's stochasticity. Classifier-free guidance (Ho & Salimans, 2022) is the trick that fixes this.

The mechanic is simple. During training, drop the conditioning $y$ with some probability, typically 10 to 20 per cent of the time. The same network thus learns to produce both a conditional noise prediction $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)$ and an unconditional one $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)$, sharing all parameters. At inference, mix them:

$$\hat{\boldsymbol{\epsilon}} = (1 + w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset),$$

with guidance scale $w \geq 0$. At $w = 0$ this is just the conditional prediction; at $w > 0$ we extrapolate away from the unconditional, amplifying whatever the prompt contributes. Stable Diffusion typically uses $w \approx 6.5$ (commonly reported as guidance scale $7.5$ in the alternate convention).

The Bayesian justification: by Bayes' rule, the conditional score is $\nabla_{\mathbf{x}}\log p(\mathbf{x} \mid y) = \nabla_{\mathbf{x}}\log p(\mathbf{x}) + \nabla_{\mathbf{x}}\log p(y \mid \mathbf{x})$. Amplifying the second term by a factor $s = 1 + w$ corresponds to sampling from a sharpened posterior $\propto p(\mathbf{x} \mid y)^s p(\mathbf{x})^{1-s}$, which concentrates mass on configurations where the prompt likelihood is high. Empirically, larger $w$ produces more prompt-faithful but less diverse samples, the eternal coverage-versus-quality trade-off. Section 14.10 develops the full derivation and discusses negative prompts (where the unconditional is replaced by a deliberately bad conditional).

Where diffusion is used in 2026

Six years on from Ho et al., diffusion has reached well beyond its origin in image generation.

  • Image generation. Stable Diffusion, DALL-E 3, Midjourney, Imagen, Adobe Firefly. All latent diffusion with cross-attention text conditioning. Resolutions of $1024 \times 1024$ and above are routine; $4096 \times 4096$ is at the frontier with multi-stage cascades.
  • Video. Sora 2 (OpenAI, September 2025, with synchronised audio), Veo 3 (Google DeepMind, native audio), Luma, Runway Gen-4, Kling 2, and Pika 2. Latent diffusion in spatio-temporal latent spaces, with patch-based Diffusion Transformers (DiT) replacing U-Nets. Sora's open-source descendants (HunyuanVideo, Wan) have brought this to consumer hardware in 2025–26.
  • 3D structure. AlphaFold 3 (Abramson et al., 2024) replaced AlphaFold 2's structure module with a diffusion module, generalising from proteins to ligands, nucleic acids, and ions. RFDiffusion designs new proteins by reverse diffusion in backbone-coordinate space.
  • Audio. AudioLDM, Stable Audio, MusicGen-Diffusion. Diffusion in latent codec spaces (Encodec, SoundStream). Text-to-speech increasingly uses diffusion (e.g. NaturalSpeech 3).
  • Robotics. Diffusion policies (Chi et al., 2023) treat action sequences as the data and learn to denoise them conditioned on observations. Closed-loop control on real robot arms with manipulation tasks has surpassed transformer-based policies in many benchmarks.
  • Materials, drug discovery, weather. GNoME and equivariant diffusion for crystal structure prediction; GeoDiff and torsional diffusion for conformer generation; GraphCast and diffusion-based weather models for ensemble forecasting.

The pattern is consistent: any domain where you can define a meaningful Gaussian noising process, which is to say, any domain with continuous data and a well-behaved metric, admits a diffusion model. The same code, sometimes the same network, ports from images to video to molecular structure with surprisingly minor modifications.

What you should take away

  1. The forward process is a fixed, hand-designed Markov chain that gradually adds Gaussian noise. The closed-form marginal $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\boldsymbol{\epsilon}$ lets us jump to any noise level in one step and is the algebraic miracle that makes training fast.
  2. The reverse process is learned, but its functional form is fixed by the structure of the forward process. The true reverse posterior $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ is Gaussian with closed-form mean and variance, so the network only has to learn the mean, equivalently, the noise.
  3. Training is mean-squared error on noise prediction. Sample a clean image, sample a timestep, sample a noise vector, add the noise, ask the network to predict it, take MSE. There are no minimax games, no partition functions, no posterior collapse, the stability of this objective is the chief reason diffusion eclipsed GANs and VAEs.
  4. Predicting noise is the same as estimating the score of the noisy distribution, up to a $t$-dependent constant. Diffusion is score matching dressed up in different notation, and the two perspectives connect via the SDE framework of Song et al. (2021).
  5. Latent diffusion and classifier-free guidance are what made diffusion practical. Latent diffusion gives an order-of-magnitude compute reduction by operating in a compressed autoencoder space; classifier-free guidance gives crisp prompt-following at the cost of one extra forward pass per step. Stable Diffusion, DALL-E 3, Midjourney, Sora and AlphaFold 3 all rely on both.

Rectified flow / flow matching has displaced DDPM as the dominant training objective for new image and video models in 2025-26 (Stable Diffusion 3, Flux, Veo 3, and Sora 2 all use flow matching, not DDPM).

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).