Vision Transformer (mathematical detail), Glossary, Textbook of AI

The Vision Transformer (ViT) treats an image as a sequence of patches and processes it with a standard Transformer encoder. Given an image $X \in \mathbb{R}^{H \times W \times C}$, ViT:

(1) Patch embedding. Divide the image into a grid of $P \times P$ patches; flatten each patch and project linearly to dimension $d$:

$$z_0 = [x_\mathrm{cls}; \, x_p^1 W_E; \, x_p^2 W_E; \, \ldots; \, x_p^N W_E] + E_\mathrm{pos}$$

where $N = HW/P^2$ is the number of patches, $W_E \in \mathbb{R}^{P^2 C \times d}$ is the patch projection, $E_\mathrm{pos} \in \mathbb{R}^{(N+1) \times d}$ are learned position embeddings, and $x_\mathrm{cls} \in \mathbb{R}^d$ is a learnable class token prepended to the sequence.

(2) Transformer encoder. Apply $L$ standard Transformer encoder blocks (multi-head self-attention + FFN, with layer norm and residual connections):

$$z'_l = \mathrm{MSA}(\mathrm{LN}(z_{l-1})) + z_{l-1}$$ $$z_l = \mathrm{FFN}(\mathrm{LN}(z'_l)) + z'_l$$

(3) Classification head. The class token's representation $z_L^0$ at the final layer is the image-level representation:

$$y = W_\mathrm{cls} \, \mathrm{LN}(z_L^0)$$

Hyperparameters for ViT-Base/16 (the most common variant):

Patch size $P = 16$, so a $224 \times 224$ image becomes a 196-token sequence.
12 layers, $d = 768$, 12 attention heads.
86M parameters.

Pretraining matters: ViT requires substantially more pretraining data than CNNs to match performance, JFT-300M (300M images) was used in the original paper. With less data, the inductive biases of CNNs (locality, translation equivariance) provide an advantage.

Modern variants include DeiT (efficient ViT trained on ImageNet alone), Swin Transformer (hierarchical, windowed attention), MAE (masked autoencoder pretraining), DINO/DINOv2 (self-supervised pretraining producing strong general features), and the CLIP / SigLIP vision encoders that ground all major vision-language models.

For most modern multimodal models (GPT-4V, Claude vision, Gemini, LLaVA), the vision component is a ViT with frozen or lightly-fine-tuned weights, with its patch token embeddings projected into the language model's embedding space and prepended to the input sequence.

Video

Discussed in:

Chapter 11: CNNs, CNNs in Vision

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).