Flamingo, Glossary, Textbook of AI

Flamingo is the family of vision-language models introduced by Jean-Baptiste Alayrac and colleagues at DeepMind in 2022. It established the cross-attention bridge architecture for VLMs and demonstrated that strong few-shot multimodal capability could be achieved by attaching a small adapter to frozen pretrained components.

Architecture. Flamingo combines:

Frozen vision encoder. A NFNet-F6 image encoder pretrained with a contrastive objective (similar to CLIP) produces dense patch features for each image or video frame.
Perceiver Resampler. A learned module that compresses the variable-length sequence of patch features into a fixed number (typically $64$) of visual tokens per image. It uses cross-attention from a learned query bank to the image features: $$Q = \text{learned}, \quad V_{\text{img}} = \text{Attention}(Q, K_{\text{patches}}, V_{\text{patches}}).$$ This decouples downstream cost from image resolution.
Frozen Chinchilla language model. A 70B-parameter pretrained LM whose weights are never updated.
Gated cross-attention layers. Inserted between every block (or every $n$-th block) of the LM. Each layer applies cross-attention from LM hidden states (queries) to the visual tokens (keys/values), gated by a learned scalar: $$h' = h + \tanh(\alpha) \cdot \text{CrossAttn}(h, V_{\text{img}}).$$ The gate $\alpha$ is initialised to $0$ so that at the start of training, Flamingo is exactly the original LM. Only the cross-attention layers, the resampler, and the gate parameters are trained.

Interleaved training. Flamingo is trained on text–image–video sequences scraped from the web in their natural interleaved order, plus image–caption pairs and video–caption pairs. This interleaved format is what gives Flamingo its few-shot capability: at inference time, prompt the model with examples in context (e.g. several "image: caption" pairs followed by a query image), and it produces a caption in the same style.

Few-shot results. With as few as 4–32 in-context examples, the 80B Flamingo model achieved state-of-the-art results on 16 visual question answering and captioning benchmarks at the time of release, often matching or beating models specifically fine-tuned on each task.

Legacy. Flamingo's gated-cross-attention pattern was reused in IDEFICS (open-source reproduction by Hugging Face), Otter, and OpenFlamingo. Although the patch-as-token pattern of LLaVA has become dominant for compute reasons, cross-attention bridges remain attractive when the language model must stay frozen (e.g. when adapting a very large proprietary LM).

Related terms: CLIP, Vision-Language Model, Transformer, LLaVA

Discussed in:

Chapter 11: CNNs, Vision-Language Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).