14.14 Text-to-image and text-to-video
The mid-2020s saw a rapid succession of text-to-image systems. We sketch the architectures and highlight the design choices.
DALL·E 2
Ramesh et al. (2022). Two-stage architecture: a prior maps a CLIP text embedding to a CLIP image embedding (with a diffusion model), and a decoder maps the image embedding to pixels (with another diffusion model). The CLIP space provides a semantically rich intermediate representation. The two-stage design lets each stage specialise.
Imagen
Saharia et al. (2022). Three diffusion models in a cascade: $64\times 64$ base, $64 \to 256$ super-resolution, $256 \to 1024$ super-resolution. The text encoder is frozen T5-XXL, bigger than CLIP's. Imagen demonstrated that text-encoder size matters more than image-decoder size for text-image alignment.
Stable Diffusion
Rombach et al. 2022. Latent diffusion as in §14.12, with a frozen CLIP text encoder, U-Net with cross-attention, and a VAE for encode/decode. Open-weights release in mid-2022; the field changed overnight.
Sora and DiT
Peebles & Xie (2023) introduced the Diffusion Transformer (DiT), replacing the U-Net with a Transformer operating on patch tokens. OpenAI's Sora (2024) extends this to video, patches are spatio-temporal, conditioning is text via cross-attention, the diffusion is in latent space (a video VAE compresses both spatial and temporal axes).
The shift from U-Net to Transformer parallels the shift in image classification from ConvNets to Vision Transformers, and is driven by the same scaling-law arguments: Transformers absorb data and parameters more efficiently than convolutional architectures at scale.
Cross-attention conditioning
In every modern text-to-image model, the conditioning enters via cross-attention. Each Transformer (or U-Net) block has self-attention over image tokens and cross-attention over text tokens. The image tokens query, the text tokens supply keys and values. The result: the image representation at every layer is informed by the text. This pattern, cross-attention as the universal conditioning mechanism, has spread far beyond text-to-image; it now appears in robotics, biology, and audio.