Attention & Transformers: 13.17   Multimodal Transformers: CLIP, Flamingo, GPT-4V, Gemini

Dr Chris Paton

13.17 Multimodal Transformers: CLIP, Flamingo, GPT-4V, Gemini

The Transformer's ability to consume any sequence of vectors makes it a natural substrate for multimodal models. The recipe: encode each modality into a sequence of tokens, then either align them in a shared space (CLIP) or concatenate them and feed to a single Transformer (Gemini).

CLIP

CLIP Radford, 2021 trains a text encoder and an image encoder jointly to produce embeddings that align in a shared space. The training objective is contrastive: given a batch of $N$ image-text pairs, compute the $N \times N$ matrix of cosine similarities; the diagonal entries (correct pairs) should be high, the off-diagonals (mismatched pairs) low. The loss is symmetric InfoNCE.

The result is a model that, with no further training, can do zero-shot image classification: given a list of class names, embed each as text, embed the image, and pick the class with the highest similarity. CLIP also became the primary text-conditioning module for image generation (DALL-E 2, Stable Diffusion) and the visual encoder for many vision-language models.

Flamingo

Flamingo Alayrac, 2022 interleaves a frozen language model (Chinchilla-70B) with a frozen vision encoder, connected by gated cross-attention layers inserted into the language model. Crucially, these cross-attention layers are additive, initialised with zero gating so that the language model behaves identically at the start of training. This trick makes it possible to bolt vision onto a strong frozen LM without disturbing its language abilities.

GPT-4V and Gemini

GPT-4V (2023) and Gemini (2023, with successors 1.5, 2, 2.5) take a more integrated approach: train a single Transformer end-to-end on a mix of text and image tokens, where image tokens come from a ViT-style encoder. The model can interleave text and images freely in both input and output. Gemini 1.5 extended this to native multimodality across text, image, audio, and video, with very long context windows (up to several million tokens).

The architectural lesson is consistent: the Transformer is modality-agnostic. Anything you can tokenise, text, image patches, audio frames, video frames, even protein sequences and chemical structures, can be fed into the same architecture. The differences between modalities live in the tokeniser and the loss function, not the backbone.