Veo (Google DeepMind, May 2024; Veo 2, December 2024; Veo 3, May 2025 with native audio) is DeepMind's flagship video generation model, the successor to Imagen Video and Phenaki, and Google's competitor to OpenAI's Sora. Veo 2 generates 1080p clips up to 8 seconds long (extendable) at frame rates including 24 fps, and Veo 3 added synchronised dialogue and ambient audio.
Architecture. Like Sora, Veo is a latent diffusion Transformer operating on spacetime patches, but Google has disclosed somewhat more detail:
Video tokeniser. A causal video autoencoder compresses 1080p input by roughly $8\times$ spatially and $4\times$ temporally to a continuous latent. Crucially, the encoder is causal in time so additional frames can be encoded incrementally, useful for image-to-video and video extension.
Backbone. A multi-billion-parameter Transformer with factorised spatiotemporal attention: alternating spatial-only and temporal-only self-attention layers. This factorisation reduces attention's $\mathcal{O}((THW)^2)$ cost to $\mathcal{O}(T (HW)^2 + HW T^2)$, which is critical for long sequences.
Conditioning. Text is embedded by a frozen Gemini-derived text encoder; image conditioning (image-to-video) uses the same VAE as the latent space; camera-control conditioning takes 6-DoF camera trajectories projected into a learned embedding, enabling cinematic prompts such as "dolly in, low-angle, track left".
Training objective. Standard v-prediction or flow-matching denoising loss:
$$\mathcal{L}(\theta) = \mathbb{E}_{z_0, \epsilon, \tau}\Big[ \big\| v_\theta(z_\tau, \tau, c) - v(z_0, \epsilon, \tau) \big\|_2^2 \Big],$$
with $v = \sqrt{\bar{\alpha}_\tau} \epsilon - \sqrt{1 - \bar{\alpha}_\tau} z_0$. Veo 3 reportedly uses rectified flow training, which gives straighter probability paths and fewer sampling steps.
Multi-stage cascade. DeepMind's earlier Imagen Video used a base model plus spatial and temporal super-resolution stages; Veo 2 reportedly retains a two-stage cascade: a base model at 480p that is upsampled by a separate diffusion super-resolution model to 1080p (sometimes 4K).
Joint audio-video (Veo 3). Veo 3 generates synchronised speech, sound effects, and music alongside video. The audio branch operates on EnCodec/SoundStream-like discrete tokens that are predicted jointly with video latents, likely via a shared Transformer backbone with separate decoder heads, with cross-attention between modalities ensuring lip sync and event sync.
Distinctive capabilities.
- Physical realism. Veo 2 was reported in benchmarks to outperform Sora on motion coherence and physics in head-to-head human evaluation.
- Long-shot consistency. Strong subject and background consistency across an 8-second clip, attributable to large temporal context windows.
- Camera control. Explicit camera-trajectory conditioning is a deliberate concession to filmmaking workflows, going beyond Sora's prompt-only control.
- SynthID watermarking. Every Veo output carries an imperceptible watermark detectable by Google's SynthID classifier, addressing deepfake concerns.
Distribution. Veo is available through Google's Vertex AI, the Gemini API, the Google AI Studio, YouTube Shorts (Dream Screen), and the Flow filmmaking tool. Unlike Sora's invitation-only release, Veo 2 became broadly available to paying Google Cloud customers in early 2025.
Position in the landscape. Sora, Veo, Movie Gen (Meta), Kling (Kuaishou), Wan 2.1 (Alibaba), Runway Gen-3 Alpha, and Pika 2.0 are converging on a similar recipe: VAE + spacetime patches + DiT + flow matching + classifier-free guidance + re-captioned training data. Differentiation now happens at data scale, audio integration, controllability, and inference cost.
Related terms: Diffusion Model, Transformer, Attention Mechanism, Sora, EnCodec
Discussed in:
- Chapter 13: Attention & Transformers, Video Generation