References

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, & Karen Simonyan (2022)

Advances in Neural Information Processing Systems 35.

URL: https://arxiv.org/abs/2204.14198

Abstract. DeepMind's Flamingo bridges a frozen vision encoder and a frozen Chinchilla language model with newly inserted gated cross-attention layers, trained on interleaved image-text web data. The architecture handles arbitrary sequences of images and text and demonstrates strong few-shot performance across captioning, visual question answering and visual dialogue. Flamingo established the template, frozen unimodal backbones plus learnable cross-modal adapters, that subsequent open multimodal models (LLaVA, IDEFICS, KOSMOS) followed.

Tags: multimodal vision-language few-shot

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).