Also known as: ViT
A Vision Transformer (ViT), introduced by Dosovitskiy et al. at Google Brain in 2020, is the application of the Transformer architecture directly to images. The image is split into fixed-size patches (typically 16×16 pixels), each patch is linearly embedded and treated as a "token", and the resulting sequence is processed by a standard Transformer encoder. A learned classification token is prepended for image-level outputs.
ViTs match or exceed CNN performance on ImageNet at sufficient pre-training scale, demonstrating that the Transformer architecture is essentially universal across modalities. The result triggered a major shift in computer vision and remains the dominant architectural paradigm for vision research as of 2025.
The simplicity of the ViT architecture relative to CNN-specific innovations (residual connections, batch normalisation, careful kernel design) is a notable methodological lesson: at sufficient data scale, the inductive biases that distinguish architectures matter less than their basic expressiveness and trainability. This generalisation has reshaped how the field thinks about architecture design.
ViT variants and refinements include: DeiT (Touvron et al., 2021), efficient training without massive pre-training data; Swin Transformer (Liu et al., 2021), hierarchical windowed attention for efficiency; MAE (He et al., 2021), masked autoencoder pre-training; DINO / DINOv2 (Caron et al., Oquab et al.), self-supervised pre-training producing strong general-purpose features.
ViTs are the standard backbone in modern multimodal models, CLIP, GPT-4V, LLaVA, Gemini's vision encoder, Claude's vision encoder, Stable Diffusion's image conditioning all use ViT-style architectures.
Video
Related terms: Transformer, Convolutional Neural Network, CLIP
Discussed in:
- Chapter 11: CNNs, CNNs in Vision