Generative Models: 14.10   Classifier-free guidance

Dr Chris Paton

14.10 Classifier-free guidance

A diffusion model trained as in §14.9 will happily generate plausible images, but plausibility on its own is not what users of Stable Diffusion, DALL-E or Midjourney are after. They want an image of this specific thing, "a watercolour of a heron standing on a jetty at dawn", and they want the generator to lean hard into that description rather than producing some loosely related average. The mechanism that turns a merely conditional diffusion model into something that actually pays attention to its prompt is called classifier-free guidance, introduced by Ho and Salimans in 2022 2022. It is, without exaggeration, the trick that made modern text-to-image diffusion practical. Train a single model that sometimes sees the conditioning and sometimes does not; at inference time, mix its conditional and unconditional predictions in a way that exaggerates the contribution of the prompt. The effect on samples is dramatic, the implementation is short, and the trade-off it exposes, prompt fidelity against sample diversity, is the dial every diffusion-based product quietly turns up or down for you.

Section 14.9 set up the denoising diffusion probabilistic model: forward noising, a noise predictor $\boldsymbol{\epsilon}_\theta$, and the reverse-time sampler. That whole machinery is unconditional. To make it conditional we feed in a side input $y$, typically a text embedding from a frozen language encoder such as CLIP or T5, and train $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)$. Section 14.10 covers what happens next: how to make sure the model actually uses $y$ rather than treating it as polite advice.

Symbols Used Here

$\mathbf{x}_t$noisy sample at timestep $t$

$y$conditioning (e.g. text embedding)

$\emptyset$null conditioning (the dropout token used at training and inference)

$\boldsymbol{\epsilon}_\theta$noise predictor (a single network handling both conditional and unconditional cases)

$s$guidance *scale* ($s \ge 1$)

$w = s - 1$guidance *offset* (the Ho-Salimans convention; $w \ge 0$)

Bayesian motivation

Start from Bayes' rule applied to the score function, the gradient of the log density in $\mathbf{x}$. Differentiating $\log p(\mathbf{x} \mid y) = \log p(\mathbf{x}) + \log p(y \mid \mathbf{x}) - \log p(y)$ with respect to $\mathbf{x}$ kills the prompt-only term, leaving

$$\nabla_\mathbf{x} \log p(\mathbf{x} \mid y) = \nabla_\mathbf{x} \log p(\mathbf{x}) + \nabla_\mathbf{x} \log p(y \mid \mathbf{x}).$$

The conditional score decomposes cleanly into the unconditional score plus a "classifier gradient" telling us how to tweak $\mathbf{x}$ so that $y$ becomes more likely under it. Now imagine we are not satisfied with the conditional likelihood ratio at its native strength, we want to sharpen it. Introduce a guidance scale $s \ge 1$ and form the amplified score

$$\hat\nabla = \nabla_\mathbf{x} \log p(\mathbf{x}) + s \, \nabla_\mathbf{x} \log p(y \mid \mathbf{x}).$$

A line of algebra shows this equals

$$\hat\nabla = \nabla_\mathbf{x} \log p(\mathbf{x} \mid y) + (s - 1)\bigl(\nabla_\mathbf{x} \log p(\mathbf{x} \mid y) - \nabla_\mathbf{x} \log p(\mathbf{x})\bigr).$$

The amplified score is the conditional score plus a multiple $(s-1)$ of the difference between conditional and unconditional scores. That difference is a vector pointing from "what is plausible in general" toward "what is plausible given the prompt"; by adding extra copies of it we walk further in the prompt-favoured direction. Integrating, $\hat\nabla$ is the score of an unnormalised distribution proportional to $p(\mathbf{x} \mid y)^s p(\mathbf{x})^{1-s}$. At $s = 1$ the second factor vanishes (its exponent is zero) and we recover plain conditional sampling. As $s$ grows the posterior is sharpened: probability mass concentrates where the conditional likelihood ratio $p(\mathbf{x} \mid y) / p(\mathbf{x})$ is large, the regions the prompt singles out as distinctively likely. This sharpening trades diversity for fidelity.

The earlier route to this idea, classifier guidance Dhariwal, 2021, required an actual classifier $p(y \mid \mathbf{x}_t)$ trained on noisy inputs, an awkward extra component. Training such a classifier is fiddly: it has to be robust at every noise level, its gradients need to be calibrated against the diffusion score, and it introduces a second source of optimisation error. Classifier-free guidance achieves the same effect without one. The "no classifier" name is therefore literal, the classifier has been absorbed into the same network that does the denoising, recovered implicitly via the difference between conditional and unconditional predictions.

It is worth pausing on why the difference $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)$ has any particular meaning. Up to a known sign and scaling, the noise predictor $\boldsymbol{\epsilon}_\theta$ is an estimator of the score $\nabla_\mathbf{x} \log p_t(\mathbf{x}_t)$ of the marginal density at timestep $t$. The conditional version estimates $\nabla_\mathbf{x} \log p_t(\mathbf{x}_t \mid y)$. Their difference therefore estimates $\nabla_\mathbf{x} \log p_t(y \mid \mathbf{x}_t)$, exactly the implicit classifier gradient that classifier guidance computes explicitly. Subtracting one head's prediction from the other extracts the prompt's contribution; multiplying that contribution by $w$ and adding it back is the guidance step.

The classifier-free implementation

The implementation trick is simple. Train a single noise predictor that handles both the conditional case and the unconditional case. During training, with probability $p_{\text{drop}}$, typically 10 to 20 per cent, replace the conditioning $y$ with a special null token $\emptyset$ before feeding it to the network. Otherwise pass the real $y$. The same parameters $\theta$ thus learn two functions: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y)$ when conditioning is present and $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)$ when it has been dropped. There is no second network and no separate classifier.

At inference, rather than choose between the two, combine them. Adopting the Ho-Salimans convention with offset $w = s - 1 \ge 0$,

$$\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset) + s\bigl(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)\bigr) = (1 + w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset).$$

The right-hand form is the one used in practice. It is a linear extrapolation: start at the unconditional prediction, draw the line through the conditional one, and step $1 + w$ of the way along that line. At $w = 0$ (equivalently $s = 1$) we get the plain conditional prediction, no guidance. At $w = 6.5$, the canonical default for Stable Diffusion ($s = 7.5$), we step seven and a half times along the conditional-minus-unconditional vector. The cost is two forward passes per denoising step rather than one; many practical samplers batch them so the wall-clock penalty is below 2x.

A worked example fixes the arithmetic. Suppose at some timestep the conditional predictor outputs $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) = 0.30$ on some pixel and the unconditional predictor outputs $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset) = 0.10$. With $w = 6.5$ the guided prediction is $(1 + 6.5)(0.30) - 6.5(0.10) = 2.25 - 0.65 = 1.60$. The conditional value of $0.30$ has been pushed to $1.60$, a more than fivefold amplification of the prompt's pull on this pixel. Plug that into the reverse-time update and the sample drifts noticeably toward the prompt. Repeat across every pixel and every step and the cumulative effect is what users perceive as crisp prompt adherence. Substitute $w = 0$ and the formula collapses to $0.30$, the unguided conditional value.

The training-time dropout rate $p_{\text{drop}}$ deserves a remark. Set it too low and the unconditional head is starved of data: its predictions become noisy and the difference vector points in unreliable directions. Set it too high and the conditional head is starved instead: the model spends so much of its capacity on the prior that prompt fidelity suffers even before guidance is applied. The sweet spot of 10 to 20 per cent is the empirical compromise: enough unconditional examples to estimate $p(\mathbf{x})$ accurately, few enough that the model's primary task remains conditional generation. Production systems often use 10 per cent and rely on the sheer scale of training data to keep the unconditional head sharp.

Why it works

Empirically, classifier-free guidance produces dramatically sharper prompt adherence. Compare two grids of samples from the same model, one at $s = 1$ and one at $s = 7.5$, and the difference is unmistakable. The unguided grid is a blurrier, more eclectic mix of plausible images that nod toward the prompt; the guided grid hits the prompt much more squarely. Theoretically we are sampling, approximately, from the sharpened posterior $p(\mathbf{x} \mid y)^s p(\mathbf{x})^{1-s}$. This places mass on regions where the conditional likelihood ratio $p(\mathbf{x} \mid y) / p(\mathbf{x})$ is high, points that are distinctively typical under the prompt rather than generically typical. Equivalently, in score-function terms, we move further along the direction in which the prompt makes a difference relative to the prior.

There are two caveats. First, the equation $\hat\nabla = \nabla \log\bigl[p(\mathbf{x} \mid y)^s p(\mathbf{x})^{1-s}\bigr]$ is exact only when the noise predictors correspond to genuine score functions and the score arithmetic commutes with the discretised reverse process; in practice both assumptions hold approximately, which is why guidance behaves well over a useful range of $s$ but degrades at extreme values. Second, the sharpening is multiplicative in the likelihood ratio rather than additive, points where the prompt is mildly favoured become strongly favoured, and points where it is mildly disfavoured become heavily disfavoured. This concentrates the distribution on a smaller subset of the prior's support, which is exactly what we want when "this prompt" is a narrow target, and exactly what we should worry about when we still want sample variety.

A useful intuition: the unconditional model knows what images look like in general, and the conditional model knows what images look like given the prompt. The guided sampler is told to walk in whichever direction the conditional model disagrees with the unconditional one, and to walk further in that direction than either model alone would suggest. This works because the disagreement between the two heads is, by construction, almost entirely about the prompt. Anything they agree on (faces have two eyes, skies are above grounds, light casts shadows) cancels out of the difference and is contributed solely by the conditional starting point; anything they disagree on (this is a heron, not a stork; this is dawn, not noon; this is a watercolour, not a photograph) is amplified.

Trade-off

Higher $w$ buys prompt fidelity at the cost of diversity. At $w = 0$ samples drawn for the same prompt look meaningfully different from one another. At $w = 6.5$ they share more structure: the same pose, lighting, palette, composition, varying only in details. Push $w$ higher still and the cost compounds. Around $w = 10$ to $w = 15$ samples often start to look saturated, oversharpened, or visibly artefact-laden, limbs duplicate, textures crisp into chrome, shadows blacken. By $w = 20$ the model collapses onto a small handful of attractor images that recur across seeds. These are the failure modes of an overcooked posterior: when the exponent on $p(\mathbf{x} \mid y)$ is large enough, almost all probability mass concentrates on a tiny region whose internal variety the model cannot resolve.

The optimal $w$ depends on the model, the dataset, and the user's purpose. For Stable Diffusion 1.x and 2.x, $w \approx 6.5$ ($s \approx 7.5$) is the de facto default. SDXL prefers slightly lower values around $5$ to $7$. Users producing many candidates per prompt, say, generating thumbnails, sometimes drop $w$ to $3$ or $4$ to widen the diversity. Users producing one final hero image often raise it. Treat the default as a sensible starting point, not as a setting to leave untouched.

There are also dynamic strategies. Several practitioners and papers have shown that varying $w$ across timesteps helps: high guidance early in the reverse process (when coarse content is being decided) and lower guidance late (when fine textures are being settled) tends to give cleaner samples than a constant scale. Other variants, autoguidance, perpendicular guidance, rescaled guidance, patch specific failure modes such as colour shift or contrast clipping at high $s$. The basic two-pass arithmetic remains the same; only the way the conditional and unconditional vectors are combined changes.

Where CFG is used

Classifier-free guidance is the universal default for modern text-to-image and text-to-video diffusion. Stable Diffusion (every version, including SDXL and SD3) uses it; DALL-E 2 and 3 use it; Imagen uses it; Midjourney uses it; eDiff-I uses it; the text-conditioned video models, Sora, Veo, Gen-3, Kling, all use it. It also generalises beyond text: depth-conditioned, edge-conditioned and pose-conditioned models (ControlNet variants) apply guidance over their conditioning channels in the same way. Audio diffusion models (AudioLDM, MusicGen via Encodec) use the same machinery. The two-pass cost is universal across these systems and is one reason inference-time efficiency work, distillation, consistency models, latent diffusion, matters so much: every saved step is two saved forward passes.

Classifier-free guidance has even leaked outside diffusion. The same trick, train one model with conditioning sometimes dropped, extrapolate at inference using the difference between conditional and unconditional outputs, has been adapted to autoregressive language models for tasks such as instruction following and detoxification, where it goes by names like "context-free guidance" or "negative prompting". The mechanism is the same: amplify the conditioning's influence beyond what the model would naturally apply. The fact that a technique originally derived for diffusion score functions transfers, with only cosmetic changes, to discrete autoregressive models is itself instructive: guidance is fundamentally about emphasising the delta a condition makes, and that delta is well-defined wherever a model can be queried with and without the condition.

What you should take away

Classifier-free guidance is the trick that makes diffusion models actually obey their prompts; without it, conditional diffusion is only loosely conditional.
Train one network with conditioning dropped 10 to 20 per cent of the time; at inference combine the conditional and unconditional predictions as $\hat{\boldsymbol{\epsilon}} = (1 + w)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)$, with $w = s - 1$.
The Bayesian reading is sampling from $p(\mathbf{x} \mid y)^s p(\mathbf{x})^{1-s}$, a sharpened posterior that concentrates mass where the prompt's likelihood ratio is high.
The default for Stable Diffusion is $w = 6.5$ (i.e. $s = 7.5$); higher values trade diversity for fidelity, and beyond $w \approx 15$ samples become oversaturated and collapse onto a small set of attractors.
The cost is two forward passes per denoising step, which is why distillation and consistency-model methods that compress this overhead are so actively pursued.