Generative Models: 14.12   Latent diffusion

Dr Chris Paton

14.12 Latent diffusion

The diffusion machinery developed in §14.9 trains a denoiser that operates directly on pixels. For a $512\times 512$ colour image, that is $786{,}432$ scalar values per sample, and the U-Net must shovel all of them through every layer at every one of the (typically) one thousand training timesteps. The result is a magnificent generative model that almost nobody can afford to train. The original DDPM paper used eight V100 GPUs for a week to fit CIFAR-10 at $32\times 32$. Scaling pixel-space diffusion to the resolutions photographers and designers actually want, $512\times 512$, $1024\times 1024$, and beyond, pushes the compute bill into hundreds of A100-days, well past the budget of any individual researcher and most academic labs.

Rombach and colleagues 2022 noticed something that, in retrospect, ought to have been obvious. Most of the bits in a natural image carry imperceptible high-frequency detail: micro-textures, sensor noise, sub-pixel edges. Human perception, and the captions humans write, depend on the semantic content, composition, objects, lighting, style, which lives in a far smaller manifold. Why force the diffusion model to spend its capacity modelling the noise on a leaf when nobody can tell whether it is correct? Their solution, latent diffusion, separates perceptual compression from generative modelling. A pretrained autoencoder squeezes the image into a small latent tensor, and the diffusion model lives entirely inside that latent space. Decoding back to pixels happens exactly once, at the end. This is the architecture of Stable Diffusion, and the reason high-resolution image generation runs on a laptop rather than a data centre.

This section assembles the components introduced earlier, DDPM in §14.9, classifier-free guidance in §14.10, and DDIM in §14.11, and shows how they combine at the latent scale.

Symbols Used Here

$\mathbf{x}$pixel-space image, $H\times W\times 3$

$\mathbf{z}$latent representation, $h\times w\times c$ with $h=H/f$, $w=W/f$

$\mathcal{E}$encoder mapping pixels to latents

$\mathcal{D}$decoder mapping latents to pixels

$f$spatial downsampling factor (typically $8$)

$\boldsymbol{\epsilon}_\theta$noise predictor, the U-Net

$y$conditioning input (text caption, class label, depth map)

The architecture

Latent diffusion is trained in two stages, with the second wholly dependent on the first.

Stage 1, autoencoder. Train a convolutional encoder–decoder pair $(\mathcal{E}, \mathcal{D})$ to compress images and reconstruct them faithfully. Stable Diffusion 1.x uses a KL-regularised continuous VAE: the encoder outputs the mean and log-variance of a Gaussian over latents, with a small KL penalty against $\mathcal{N}(\mathbf{0}, \mathbf{I})$ to keep the latent distribution well-behaved. Crucially, training is not a vanilla VAE objective. The reconstruction term combines an $L_1$ pixel loss, an LPIPS perceptual loss (a pretrained VGG features distance), and an adversarial term from a PatchGAN discriminator, borrowed wholesale from VQGAN (Esser et al. 2021). The KL weight is set very low, so the autoencoder is closer to a regularised AE than a true VAE. The result is sharp, high-fidelity reconstructions that retain enough perceptual structure to support downstream generation. With $f=8$, a $512\times 512\times 3$ image becomes a $64\times 64\times 4$ latent, a forty-eight-fold reduction in tensor size.

Stage 2, latent diffusion. Freeze $(\mathcal{E}, \mathcal{D})$ and train a U-Net $\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, y)$ to denoise corrupted latents. The forward process is the familiar Gaussian noising schedule, applied to $\mathbf{z}_0 = \mathcal{E}(\mathbf{x})$: $$\mathbf{z}_t = \sqrt{\bar\alpha_t}\,\mathbf{z}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0}, \mathbf{I}).$$ The training loss is the same simplified DDPM objective: $$\mathcal{L} = \mathbb{E}_{\mathbf{z}_0, t, \boldsymbol{\epsilon}}\bigl[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, y)\|^2\bigr].$$ Conditioning $y$ enters via cross-attention layers wired into every block of the U-Net.

Generation. Sample $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ in latent space, apply iterative denoising (DDIM, fifty steps), and run a single decode pass $\mathbf{x}_0 = \mathcal{D}(\mathbf{z}_0)$ at the very end. The pixel-space VAE never sees noise, it only ever decodes a clean latent. That separation of duties is the whole trick.

A subtle but important detail: the latents are rescaled before training and unscaled before decoding. Empirically, the encoder's outputs have a standard deviation of roughly $0.18$ in Stable Diffusion 1.x, far from the unit variance the diffusion schedule assumes. Multiplying by a constant scaling factor (the scale_factor of $0.18215$) brings the latents to approximately unit variance, after which the noising schedule designed in §14.9 transfers across without modification. Skipping this step does not break training, but it shifts the effective signal-to-noise ratio of every timestep and silently degrades quality.

Why latent

The compute saving is dramatic and easy to state. A convolutional layer's cost scales with the spatial resolution of its input. A U-Net operating on $512\times 512$ inputs handles $48\times$ more spatial positions than the same U-Net at $64\times 64$, and that factor multiplies through every step of the thousand-step diffusion process. Rombach's paper reports that pixel-space diffusion at $256\times 256$ took roughly $250$ V100-days to train; latent diffusion at the same effective resolution took about $5$ V100-days for comparable FID. That is the difference between a research project and a graduate-student weekend, and it is why Stable Diffusion 1 was released as an open checkpoint trainable on academic budgets.

There is, however, no free lunch: the autoencoder is a hard ceiling on quality. Whatever $\mathcal{D}$ cannot reconstruct, the diffusion model cannot generate, no matter how good the U-Net. Faces, fine text, and high-frequency repeating patterns are the usual victims. Stable Diffusion 1.4's notorious difficulty with hands and small text traces partly to autoencoder limits. Subsequent versions improved the autoencoder rather than the diffusion model: Stable Diffusion XL retrained the VAE at higher resolution with a longer EMA, and Stable Diffusion 3 adopted a sixteen-channel latent (versus the original four) for a meaningful jump in fidelity. The lesson: in latent diffusion, the autoencoder does the perceptual work and the U-Net does the semantic work, and the system is bottlenecked by whichever is weaker.

A related design choice is the downsampling factor $f$. Rombach's ablations showed a sweet spot around $f=8$: smaller factors (less compression) keep more pixel-space pathology in the latents and waste compute on unimportant detail; larger factors throw away too much information and the autoencoder reconstructions degrade. The $f=8$, $c=4$ configuration of Stable Diffusion 1.x is a deliberate choice on this Pareto curve, not an accident.

It is worth comparing latent diffusion to two adjacent ideas. Cascaded diffusion (used by Imagen and DALL-E 2) trains a base diffusion model at low resolution and one or more pixel-space super-resolution diffusion models on top, also a form of multi-scale decomposition, but every stage is itself a full diffusion process and the super-resolution models still pay the pixel-space price. Pixel-space diffusion with progressive distillation (Karras et al.) reduces the number of sampling steps but keeps the cost per step. Latent diffusion attacks a different axis, cost per step, and composes cleanly with the others. Modern systems combine all three: a latent-space diffusion model, distilled to a small number of steps, sometimes followed by a lightweight pixel refiner.

Text conditioning

Latent diffusion is generative, but for it to be useful it must be steerable. The mechanism is cross-attention from a frozen text encoder into the U-Net. In Stable Diffusion 1.x the encoder is the text tower of CLIP ViT-L/14, returning a sequence of seventy-seven token embeddings of dimension seven hundred and sixty-eight. SD 2 uses OpenCLIP ViT-H/14; SDXL concatenates two CLIP encoders (ViT-L plus ViT-bigG) for a richer conditioning signal; Imagen, a closely related model from Google, used a frozen T5-XXL instead, arguing that a stronger language model is more important than a stronger image–text aligner.

Within each U-Net block, after the usual convolutional residual stack and self-attention over latent positions, a cross-attention layer queries from the latent feature map and keys/values from the text embeddings: $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V,$$ where $Q$ is a linear projection of latent positions and $K, V$ are projections of the seventy-seven text tokens. This lets each spatial location decide, layer by layer, which words it should pay attention to. The attention maps are not random: visualisations show that at intermediate timesteps the cross-attention for the word cat concentrates over pixels that will become a cat. This is the hook that ControlNet, prompt-to-prompt editing, and attention-based localisation methods latch onto.

Classifier-free guidance, from §14.10, sits on top. During training, the text caption is dropped to the empty string with probability $0.1$, so the network learns both conditional and unconditional scores with a single set of weights. At inference, the guided score is $$\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \emptyset) + w\bigl(\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, y) - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \emptyset)\bigr).$$ A guidance weight $w$ between $5$ and $9$ is typical; higher values produce more prompt-faithful but less diverse, more saturated images.

Inference

A single image is a sequence of small operations rather than one large one. With DDIM sampling and CFG, generation proceeds as follows. Encode the prompt and the empty string through CLIP. Sample $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. For fifty timesteps, run two forward passes through the U-Net (conditional and unconditional), combine via the CFG formula, and apply the deterministic DDIM update. After the final step, run $\mathcal{D}(\mathbf{z}_0)$ once. Total: one hundred U-Net evaluations and a single decode. On an RTX 3090 in mid-2022 that took about five seconds at $512\times 512$; on contemporary hardware with FlashAttention, half-precision weights, and compiled kernels, the same recipe finishes in under a second.

Memory, not just compute, is the practical bottleneck. A $64\times 64\times 4$ latent and its activations fit comfortably in the eight gigabytes of VRAM available on a mid-range consumer card; the same forward pass at $512\times 512\times 3$ would not. This is why the original Stable Diffusion release was greeted as a paradigm shift in who could run image generation, not merely how fast it ran. Within a few months, optimised forks were running it on six gigabytes, then four, then on Apple Silicon laptops via Metal. None of those engineering wins would have mattered if the model had to denoise pixels.

The breakdown of cost is illuminating. Each U-Net evaluation at $f=8$ requires roughly $10^{12}$ FLOPs, dominated by the cross-attention and convolution layers operating on $64\times 64$ feature maps. Performing the same computation in pixel space would multiply that figure by the same forty-eight that motivated the design. The decoder, by contrast, is run once and is small relative to the diffusion U-Net. The text encoder is also run once. Almost all the inference budget, typically over ninety per cent, is spent inside the U-Net, which is why every serious deployment optimisation (xFormers, FlashAttention, INT8 quantisation, distillation to four-step or even one-step samplers like LCM and Turbo) targets the U-Net specifically.

Where it is used

Stable Diffusion 1.4, 1.5, 2.0, 2.1, XL, 3, and 3.5 (October 2024); Black Forest Labs' Flux.1 family (pro/dev/schnell, August 2024); Imagen 3 and Imagen 4 are all latent or near-latent diffusion models, differing primarily in the autoencoder, the text encoder, the U-Net (or in the case of SD 3, a Diffusion Transformer) and training data. The architecture has also influenced almost every modern open image generator: PixArt-$\alpha$, Würstchen, Playground v2.5, and Flux all build on the latent-space recipe. DALL-E 3, while closed, is widely understood to follow the same pattern. Adobe Firefly, Midjourney v5 onwards, and several commercial APIs are reported variants of the same pattern. Beyond images, the recipe generalises: latent video diffusion (Stable Video Diffusion, Sora's reported VAE-then-DiT structure), latent audio diffusion (Stable Audio uses an audio VAE plus a latent diffusion transformer), and even 3D asset generation (Shap-E, latent NeRFs) all pay the perceptual-compression-plus-latent-diffusion bill. The pattern is so dominant that it is easier to list the post-2022 generative image systems that do not operate in latent space than those that do.

What you should take away

Latent diffusion separates perceptual compression from generative modelling. A pretrained autoencoder shrinks the image; the diffusion U-Net does its work in the small latent space; the decoder restores pixels exactly once at the end.
Compute scales with spatial resolution squared. Compressing $512\times 512$ pixels to a $64\times 64$ latent yields a forty-eight-fold reduction in tensor size and roughly a fifty-fold reduction in training cost.
The autoencoder caps quality. Stable Diffusion 1.x uses a KL-regularised continuous VAE trained with a VQGAN-style adversarial perceptual loss, not a vanilla VAE and not VQ-VAE. Improving the VAE has driven much of the quality gain across SD versions.
Text conditioning is cross-attention. Frozen CLIP or T5 embeddings flow into every U-Net block via cross-attention; classifier-free guidance steers strength at inference.
The pattern generalised. Latent video, latent audio, and latent 3D all use the same template. If you encounter a new modality and want to do diffusion at scale, the first question is which autoencoder you trust to compress it.