Video understanding models are neural networks trained to extract general-purpose representations from video. Unlike video generation models (Sora, Veo) which sample new clips, understanding models encode clips into features useful for action recognition, retrieval, anomaly detection, and as visual front-ends for video-capable VLMs. The dominant 2022-2025 approaches are self-supervised, requiring no human labels.
VideoMAE (Tong et al. 2022). Adapts the Masked Autoencoder recipe to video. A clip is divided into spatiotemporal patches; 90–95% are randomly masked; a ViT encoder processes only the visible patches; a small decoder reconstructs the masked pixel patches:
$$\mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \left\| \hat{\mathbf{x}}_i - \mathbf{x}_i \right\|_2^2$$
where $\mathcal{M}$ indexes masked patches. The high masking ratio is the key insight: video has high temporal redundancy, so most patches are predictable from a few neighbours. VideoMAE achieves strong action recognition on Kinetics-400/600/700 and Something-Something v2.
V-JEPA (Bardes, LeCun et al. 2024). Meta AI's Joint-Embedding Predictive Architecture for video. Instead of reconstructing masked pixels, V-JEPA predicts the embeddings of masked spatiotemporal regions, using a target encoder updated as an exponential moving average of the online encoder. The objective is
$$\mathcal{L} = \mathbb{E}_{\text{clip}} \left\| s_\theta(\hat{\mathbf{z}}_y) - \text{stop-grad}(s_{\bar{\theta}}(\mathbf{z}_y)) \right\|_2^2$$
where $\mathbf{z}_y$ are target embeddings. JEPA avoids predicting pixels (which wastes capacity on fine texture) and avoids contrastive negatives (which require careful sampling). V-JEPA features outperform VideoMAE on many downstream tasks at lower compute, and underpin LeCun's broader argument for non-generative world models.
InternVideo and InternVideo 2. Shanghai AI Lab's series, combining masked modelling, contrastive learning, and supervised tuning across web-scale video. InternVideo 2 (2024) is a 6B-parameter video encoder that serves as the vision backbone for several frontier video VLMs.
Use cases.
- Action recognition. Linear probing or fine-tuning the encoder for classification.
- Video retrieval. Encode a query and a corpus into a shared embedding space (often combined with CLIP-style text encoders).
- VLM front-ends. Replace per-frame image encoders with a video encoder that captures temporal dependencies; used by VideoLLaMA, Video-ChatGPT, and others.
- World models. V-JEPA features are used as the observation encoder in model-based RL agents.
Open questions. Whether self-supervised video understanding will scale to true 3D world understanding (object permanence, physical reasoning), or whether generative video models like Sora will subsume both generation and understanding by virtue of having learned an implicit world model. As of 2025 the community is divided.
Related terms: Video Diffusion Models, Sora, Vision Transformer, CLIP, Self-Supervised Learning
Discussed in:
- Chapter 11: CNNs, Video Understanding