Glossary

Vision Transformer

Also known as: ViT

The Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020, demonstrated that an image can be processed by a standard transformer encoder with minimal modification. The idea is simple: divide the image into a grid of fixed-size patches (e.g., 16×16 pixels), linearly embed each patch into a vector, add positional encodings, and feed the resulting sequence to a standard transformer encoder. A special [CLS] token (borrowed from BERT) is prepended for classification; its output representation is passed to a linear classifier head.

ViT initially struggled to match CNNs on small datasets because it lacks the inductive biases of local connectivity and translation equivariance that CNNs encode. But when pretrained on very large datasets (JFT-300M, LAION, and their successors), ViT matches or exceeds state-of-the-art CNNs on ImageNet classification, often with fewer parameters and fewer FLOPs. Pretraining on text-image pairs (as in CLIP) yields ViTs that serve as universal visual encoders for multimodal models.

ViT's success spawned a rich family of variants: Swin Transformer introduces hierarchical structure with shifted windows; DeiT shows ViTs can be trained on ImageNet alone with better recipes; MAE (Masked Autoencoder) pretrains ViTs by reconstructing masked image patches. Hybrid architectures like ConvNeXt apply transformer-inspired design choices to pure CNNs, while others combine convolutional stems with transformer blocks. The ViT has shown that self-attention, with sufficient data, can match or exceed the inductive biases of convolution, reshaping computer vision architecture design.

Related terms: Transformer, Convolutional Neural Network, Self-Attention, CLIP

Discussed in:

Also defined in: Textbook of AI