Video Diffusion Models, Glossary, Textbook of AI

Video diffusion models are generative models that produce video clips by adapting diffusion from images to spacetime. The 2024-2025 generation includes OpenAI Sora, Google Veo 2 and Veo 3, Runway Gen-3 Alpha, Stable Video Diffusion, Kling (Kuaishou), Hailuo (MiniMax), and Mochi 1. They have collapsed the cost of producing short photorealistic clips from days of artist labour to seconds of inference.

Diffusion for video. A video $\mathbf{x} \in \mathbb{R}^{T \times H \times W \times 3}$ is corrupted by Gaussian noise over a forward process

$$\mathbf{x}_s = \sqrt{\bar{\alpha}_s}\, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_s}\, \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$$

and a neural network learns the reverse process $p_\theta(\mathbf{x}_{s-1} \mid \mathbf{x}_s, \mathbf{y})$ conditional on a text prompt $\mathbf{y}$. The training loss is the standard noise-prediction MSE

$$\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}, s} \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_s, s, \mathbf{y}) \right\|_2^2.$$

Two architectural eras.

U-Net video diffusion (2022-2023). Stable Video Diffusion, AnimateDiff, and Make-A-Video extended a 2D U-Net to 3D by adding temporal attention layers between the spatial blocks. Limited to 4-second clips at 24 fps.
Diffusion transformers on spacetime patches (2024-). Sora introduced spacetime patches (Peebles & Xie 2022 for images, generalised to video): the video is encoded by a learned 3D VAE into a latent volume $\mathbf{z} \in \mathbb{R}^{T' \times H' \times W' \times d}$, then partitioned into non-overlapping spatiotemporal patches that are flattened into a sequence of tokens. A diffusion transformer (DiT) then operates on this sequence, exactly like an LLM operates on text tokens. This unlocked variable-resolution, variable-aspect-ratio, variable-duration generation.

Sora architecture. OpenAI's published technical report describes Sora as a diffusion transformer on spacetime patches with learned 3D VAE, capable of up to 60 seconds at 1080p. Compute scales predictably with patches, a video scaling law analogous to LLM scaling.

Veo 3, Gen-3, Kling. Use the same spacetime-patch DiT recipe with proprietary improvements. Veo 3 (2025) introduced audio-aligned generation: the model emits synchronised dialogue, sound effects, and ambient audio jointly with the video, a notable departure from silent-only earlier systems.

Limitations. Physical implausibility (objects appearing or disappearing, hands with wrong finger counts), failure to maintain object permanence over long clips, hallucinated text, and limited fine-grained control over camera and character motion remain. Video diffusion is also extremely compute-intensive: a 5-second 1080p clip costs $\sim$10$\times$ a 1024$\times$1024 image.

Significance. Video diffusion has matured from research demo (2022) to consumer product (2024-2025), with Sora, Veo, and Kling now in commercial deployment. It is the engine behind nearly all "AI video" tools, and its outputs are appearing in advertising, social media, and (controversially) film.

Discussed in:

Chapter 11: CNNs, Video Generation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).