Sora, Glossary, Textbook of AI

Sora (OpenAI, February 2024) is a text-conditioned video generation system that produces up to one minute of 1080p video from a prompt, an image, or an existing video. Its central architectural choice is to apply the Diffusion Transformer (DiT) recipe, proven on images by Peebles and Xie (2023), to spacetime patches of a learned video latent.

Spacetime patch tokens. A video VAE (similar to LDM's image VAE but with 3-D causal convolutions) compresses raw video into a lower-dimensional latent of shape $C \times T' \times H' \times W'$, with both spatial and temporal compression. The latent is then partitioned into non-overlapping 3-D patches of size $p_t \times p_h \times p_w$ (e.g. $1 \times 2 \times 2$). Each patch is flattened and linearly embedded, producing a sequence of tokens whose length depends on video duration, resolution, and aspect ratio. This patches-as-tokens view (analogous to ViT for images) lets a single Transformer handle arbitrary video shapes by simply varying sequence length.

Diffusion Transformer backbone. A large Transformer (parameter count not officially disclosed; estimated tens of billions) replaces the U-Net used in earlier video diffusion models such as Imagen Video. It receives noised latent tokens $z_\tau$, a diffusion timestep $\tau$, and text-derived conditioning, and predicts the noise $\epsilon$ added at step $\tau$. Training minimises the standard denoising objective

$$\mathcal{L}(\theta) = \mathbb{E}_{z_0, \epsilon, \tau, c}\Big[ \big\| \epsilon - \epsilon_\theta(z_\tau, \tau, c) \big\|_2^2 \Big], \quad z_\tau = \sqrt{\bar{\alpha}_\tau} z_0 + \sqrt{1 - \bar{\alpha}_\tau} \epsilon,$$

where $\bar{\alpha}_\tau$ is the cumulative noise schedule and $c$ is the text conditioning. Conditioning is injected via adaptive layer norm with zero-init (adaLN-Zero), as in the DiT paper.

Re-captioning. Following DALL-E 3, OpenAI trained a captioner to write detailed video descriptions and used these synthetic captions for training. At inference, the user's short prompt is expanded by GPT into a longer descriptive caption that better matches the training distribution, a form of test-time prompt engineering hidden inside the API.

Variable resolution, duration, aspect ratio. Earlier video models trained on fixed resolution and duration; Sora trains on native sizes. This requires:

Padded sequence batching with attention masks excluding padding tokens.
3-D positional encoding (separately for $t, h, w$), often factorised RoPE-style, that extrapolates to longer sequences than seen at training.
Bucketed batching so similar-length samples train together for efficiency.

Empirically, training on variable shapes improves framing, composition, and temporal coherence more than training on fixed crops at the same FLOP budget, Sora samples noticeably exhibit cinematic framing.

Capabilities. Sora generates physically plausible motion, occlusion handling, multi-shot sequences, and consistent characters across cuts. OpenAI framed it as a "world simulator" since the model implicitly learns object permanence, physics, and 3-D consistency from data, though failure modes (objects spontaneously appearing, fluid dynamics violations, hand counts) reveal the limits of this implicit understanding.

Inference. Classifier-free guidance with $w \approx 4$-7, plus DPM-Solver++ or flow-matching ODE sampling for tens of steps. Generating a 60-second 1080p clip is reported to require many GPU-minutes.

Successors and competitors. Sora-Turbo (December 2024) reduced cost and latency. Veo 2 (Google), Movie Gen (Meta), Kling, Runway Gen-3, and Wan (Alibaba) all adopt the diffusion-Transformer-on-spacetime-patches recipe.

Video

Discussed in:

Chapter 13: Attention & Transformers, Video Generation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).