Generative Models: 14.11   DDIM

Dr Chris Paton

14.11 DDIM

The denoising diffusion probabilistic model of §14.9 produces high-quality samples but at a steep computational price. To turn pure noise into a single image the network must be evaluated hundreds, sometimes a full thousand, times. Each call passes the current noisy image through a U-Net that may contain a billion or more parameters. On a desktop GPU this turns what feels conceptually like "generate one picture" into something closer to "render a short film". For a generative model that hopes to live inside a chat interface, a phone, or a real-time creative tool, that latency is fatal.

Denoising diffusion implicit models, introduced by Song, Meng and Ermon in 2021, address the cost head-on. They keep the forward noising process untouched and re-derive the reverse process so that the same trained network can be sampled in fifty steps, sometimes ten, instead of a thousand, with image quality that is essentially indistinguishable to the human eye. No retraining is required. A model trained as a DDPM can be sampled as a DDIM the same afternoon. This is why every production diffusion system you are likely to use, Stable Diffusion, Midjourney, DALL-E variants, video diffusers, relies on DDIM or one of its descendants rather than on the original DDPM sampling loop.

Classifier-free guidance (§14.10) taught us how to steer a diffusion sampler without an external classifier; DDIM teaches us how to accelerate one without retraining. DDPM defines what the model knows, classifier-free guidance defines what we ask it to draw, and DDIM defines how patiently we are willing to wait for the answer. All three knobs are independent: any guidance scale with any DDIM step count on top of any DDPM-trained checkpoint, which is exactly the flexibility that turned diffusion from a research curiosity into a deployable technology in roughly eighteen months.

Symbols Used Here

$\mathbf{x}_t$noisy image at diffusion step $t$

$\boldsymbol{\epsilon}_\theta$the trained noise-prediction network

$\hat{\mathbf{x}}_0$the network's implicit estimate of the clean image

$\bar\alpha_t$cumulative signal-retention coefficient at step $t$

$\eta$stochasticity dial; $\eta = 0$ deterministic, $\eta = 1$ recovers DDPM

The DDIM update

Given a noisy state $\mathbf{x}_t$, the network produces a single prediction, $\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$, of the noise that was added on the way in. From this prediction we can immediately estimate the original clean image. The forward process tells us $\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}$; rearranging and substituting the network's noise estimate gives

$$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\hat{\boldsymbol{\epsilon}}}{\sqrt{\bar\alpha_t}}.$$

This is the same $\hat{\mathbf{x}}_0$ that lives implicitly inside DDPM, the network always thinks it knows what the clean image looks like; we are simply making that estimate explicit. The DDIM update then rebuilds a less-noisy image at the next timestep by mixing $\hat{\mathbf{x}}_0$ with the predicted noise, scaled to match the marginal at $t-1$:

$$\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\hat{\boldsymbol{\epsilon}} + \sigma_t\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).$$

The first term places us at the predicted clean image scaled to the next noise level. The second term re-injects the predicted noise direction at exactly the magnitude the marginal $q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)$ requires. The third term is fresh noise of variance $\sigma_t^2$, controlled by the stochasticity dial. Setting $\sigma_t^2 = \eta\,\tilde\beta_t$ with $\eta\in[0,1]$ interpolates between the two extremes that matter: $\eta=0$ removes the random term and gives a deterministic sampler, the same starting noise $\mathbf{x}_T$ always produces the same image, while $\eta=1$ recovers the original DDPM update exactly. In the deterministic case the formula simplifies to

$$\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}}\,\hat{\boldsymbol{\epsilon}}.$$

The crucial theoretical fact is that for any choice of $\sigma_t$ in the allowed range, the marginal distribution of $\mathbf{x}_t$ given $\mathbf{x}_0$ is identical to the DDPM marginal. The training objective only ever cared about those marginals, so the network we trained as a DDPM is also a valid DDIM. Sampling becomes a hyperparameter choice, not a model choice. This is a useful separation of concerns: the expensive, multi-week training run produces an artefact that does not commit to any one sampling strategy, and individual users are free to pick the deterministic regime, the stochastic regime, or anything in between, at inference time. Few results in modern deep learning give the practitioner this much freedom for free.

Why fewer steps work

The deterministic DDIM update is, in disguise, a discretisation of an ordinary differential equation. The forward noising process is a Markov chain in DDPM, but DDIM reframes it as a non-Markovian process whose marginals match, a process that admits a deterministic flow from data to noise and back. Once you are integrating an ODE, the question "how many steps?" stops being about Markov-chain mixing and becomes the familiar question of numerical analysis: how coarsely can I discretise this trajectory before the integration error becomes visible?

The answer turns out to be: surprisingly coarsely. Because the forward marginal $q(\mathbf{x}_t\mid\mathbf{x}_0)=\mathcal{N}(\sqrt{\bar\alpha_t}\,\mathbf{x}_0,(1-\bar\alpha_t)\mathbf{I})$ is defined for every $t$, we are free to choose any subset of timesteps and apply the DDIM update on that subset. Pick a strictly increasing schedule $\{\tau_1 < \tau_2 < \cdots < \tau_S\}$, typically fifty values evenly spaced between 1 and $T=1000$, and run the update from $\tau_S$ down to $\tau_1$. Each step now jumps roughly twenty timesteps of the original schedule rather than a single one.

The trajectory through image space stays close to the data manifold throughout. In the deterministic regime, every intermediate $\hat{\mathbf{x}}_0$ is the network's best guess of the final image; at high noise levels this guess is blurry and generic, at low noise levels it is sharp and specific, and the path traced by the partially-denoised states curves smoothly between the two. Skipping ahead in the schedule simply takes a longer chord of the same curve. The U-Net was trained to be a good noise predictor at every $t$, so it remains a competent predictor at the coarse subset.

There is, however, a quality–speed trade-off, and it is informative. With $S=1000$ DDIM and DDPM produce nearly identical samples. With $S=100$ the difference is invisible. With $S=50$, the standard production setting for many years, quality holds up well on natural images. Below about ten steps, quality from naive DDIM begins to visibly degrade, and you need a higher-order ODE solver (see below) to keep the trajectory accurate. Stochastic DDIM with $\eta>0$ tolerates fewer steps less gracefully; this is one reason the deterministic regime dominates in practice.

A pleasant side benefit: deterministic DDIM gives you a meaningful latent space. The map $\mathbf{x}_T \mapsto \mathbf{x}_0$ becomes a bijection (modulo numerical error). Interpolating between two starting noise tensors and running DDIM on the interpolated path produces a smooth sequence of images, the kind of latent-space cinematography that GANs were once celebrated for and that stochastic DDPM cannot offer. The standard recipe is spherical linear interpolation, slerp, on the unit hypersphere where Gaussian noise tensors approximately live; following the resulting path through the deterministic sampler yields the smooth morphs familiar from demos of Stable Diffusion. Deterministic DDIM also enables image editing by inversion: run the deterministic update in reverse to lift an existing image back to its noise representation, perturb that representation, then run forward again. The clean closed-form invertibility is impossible under stochastic sampling.

Worked

Take a model trained on the standard $T=1000$ cosine schedule. To sample with fifty steps we choose the subset $\tau \in \{1000, 980, 960, \ldots, 40, 20\}$, fifty timesteps spaced by twenty.

Start with $\mathbf{x}_{1000}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$. At step one the network sees $(\mathbf{x}_{1000}, t=1000)$ and predicts $\hat{\boldsymbol{\epsilon}}$. Compute $\hat{\mathbf{x}}_0 = (\mathbf{x}_{1000} - \sqrt{1-\bar\alpha_{1000}}\,\hat{\boldsymbol{\epsilon}})/\sqrt{\bar\alpha_{1000}}$, at this stage $\bar\alpha_{1000}\approx 0$, so $\hat{\mathbf{x}}_0$ is a noisy, blurry sketch. Then jump to $\mathbf{x}_{980} = \sqrt{\bar\alpha_{980}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{980}}\,\hat{\boldsymbol{\epsilon}}$.

Repeat. At step twenty-five we are at $t=500$; $\hat{\mathbf{x}}_0$ now resembles a recognisable but soft scene. By step forty we are at $t=200$ and the predictions are sharp; the residual updates are small refinements rather than wholesale guesses. The final step lands at $\mathbf{x}_0$, the generated image.

Total cost: fifty U-Net evaluations, against a thousand for the equivalent DDPM run, a twenty-fold speed-up. On an A100, a 512-pixel pixel-space diffusion sample drops from roughly twenty seconds to one. Combine with latent diffusion (§14.12) and you reach the sub-second per-image budgets that interactive applications demand. The same code path supports image-to-image and inpainting workflows by initialising at an intermediate timestep $\tau_S < T$ rather than at pure noise, then running the remaining DDIM steps; this is how creative tools accept a rough sketch and turn it into a finished render in a fraction of a second.

Other fast samplers

DDIM was first; it is no longer the floor. Treating sampling as ODE integration invites the full toolkit of numerical analysis. DPM-Solver (Lu et al., 2022) uses a second- or third-order semi-linear solver tailored to the diffusion ODE and produces sharp samples in ten to twenty steps. UniPC (Zhao et al., 2023) is a unified predictor–corrector framework that pushes the floor a little lower again. Karras et al. (2022) advocate a second-order Heun method on a re-parameterised noise schedule and obtain state-of-the-art FID with as few as thirty-six function evaluations. PLMS, the linear multi-step method that powered early Stable Diffusion releases, sits in the same family.

All of these samplers share the DDIM lineage: they assume a pre-trained noise predictor and exploit the deterministic ODE structure that DDIM exposed. None of them require retraining. Choosing a sampler has become a deployment decision, similar to choosing a quantisation scheme for an LLM, made independently of the model itself. The user-facing UIs of Stable Diffusion and its forks now expose a sampler dropdown, Euler, Heun, DPM++ 2M, UniPC, DDIM, and most users never realise they are picking between competing numerical integrators of the same underlying ODE. A more recent line of work, distillation-based samplers such as progressive distillation and consistency models, compresses the trajectory into one or two steps by training a new student network; that route trades the freedom of DDIM-style hyperparameter choice for raw speed and is the engine behind the real-time diffusion demos that began appearing in 2024.

What you should take away

DDIM keeps the forward noising process unchanged and rewrites the reverse step so that the same trained DDPM can be sampled deterministically.
The single dial $\eta$ slides smoothly from deterministic ($\eta=0$) to fully stochastic DDPM ($\eta=1$); the marginals are preserved throughout.
Because the forward marginal is defined at every timestep, you are free to subsample the schedule, typically to fifty steps, and obtain a roughly twenty-fold speed-up at near-identical quality.
The deterministic regime exposes a smooth latent-to-image bijection, enabling meaningful interpolation between starting noise tensors.
DDIM is the gateway to the modern fast-sampler family, DPM-Solver, UniPC, Heun, PLMS, that powers virtually every production diffusion system in use today.