CLIP (Contrastive Language-Image Pre-training), introduced by Radford et al. at OpenAI in 2021, is a foundational model for aligning visual and textual representations. It consists of an image encoder (a Vision Transformer or ResNet) and a text encoder (a Transformer), trained jointly on 400 million image-caption pairs scraped from the internet. The training objective is contrastive: given a batch of image-text pairs, maximise the cosine similarity between matching pairs while minimising it for all non-matching pairs.
The resulting embedding space aligns visual and textual representations, enabling several useful capabilities. Zero-shot image classification works by computing text embeddings for class descriptions like "a photo of a dog" and "a photo of a cat" and comparing to the image embedding—without any task-specific training. Cross-modal retrieval finds images matching text queries or vice versa. CLIP embeddings also serve as the conditioning signal for text-to-image systems like Stable Diffusion and DALL·E 2, where they provide the semantic bridge between user prompts and generated images.
CLIP's success was foundational in several ways. It showed that self-supervised multimodal pretraining on internet-scale data could produce highly transferable representations. It established the contrastive paradigm that now underlies many multimodal systems. And it introduced the idea of zero-shot classification via natural language descriptions, which has become standard for flexible, open-vocabulary vision tasks. CLIP's descendants—including SigLIP, EVA-CLIP, and various domain-specific variants—continue to power a huge range of modern multimodal applications.
Related terms: Multimodal Model, Embedding, Self-Supervised Learning, Vision Transformer
Discussed in:
- Chapter 15: Modern AI — Multimodal Models
Also defined in: Textbook of AI