Attention & Transformers: 13.16   Vision Transformer (ViT)

Dr Chris Paton

13.16 Vision Transformer (ViT)

For five years after AlexNet, computer vision was synonymous with convolutional neural networks. Inductive biases, translation equivariance, local receptive fields, hierarchical pooling, were considered essential for image recognition. The Vision Transformer broke that assumption.

Patch embedding

Dosovitskiy et al.'s ViT 2020 does the most direct possible thing: cut an image into a grid of patches, flatten each patch into a vector, and feed the sequence of patch vectors into a standard Transformer encoder.

For a $224 \times 224$ image with $16 \times 16$ patches, you get $14 \times 14 = 196$ patches. Each patch is $16 \times 16 \times 3 = 768$ values, projected by a linear layer to $d_\text{model} = 768$. A learnable [CLS] token is prepended. Learned positional embeddings are added. Off it goes into a Transformer encoder with the standard 12 layers, 12 heads.

For classification, the final embedding of the [CLS] token is fed to a linear classifier.

Why it worked

Three reasons.

First, scale. With ImageNet-1k (1.3M images), ViT underperformed good CNNs. With ImageNet-21k (14M images) or JFT-300M (300M images), ViT matched or beat CNNs. The Transformer's lack of inductive bias is a weakness on small data and a strength on large data: more capacity to learn whatever structure the data has, not just translation equivariance.

Second, transfer. Pretrained ViTs transfer extremely well to downstream visual tasks: object detection, segmentation, fine-grained classification.

Third, multimodality. Because a ViT embeds an image into a sequence of vectors, it composes naturally with text Transformers. CLIP, DALL-E, Flamingo, GPT-4V, Gemini all use ViT-like image encoders that produce token sequences which can be cross-attended to or concatenated with text tokens.

Variants

Swin Transformer introduces a hierarchical structure with shifted local attention windows, restoring some of the multi-scale inductive bias of CNNs.
DeiT adds a distillation token and improves data efficiency to make ViT trainable on ImageNet-1k alone.
ConvNeXt redesigns CNNs with Transformer-inspired choices (LayerNorm, GELU, large kernels) and matches ViT, suggesting the gap is more about training recipe than architecture.

ViT settled vision. As of 2026 the dominant image backbones in production frontier models are ViT-style, often with hierarchical refinements.