Contrastive learning is the family of self-supervised methods that learn representations by pulling similar pairs together and pushing dissimilar pairs apart in embedding space. Foundation of modern self-supervised learning for vision, audio, multimodal models.
Loss: typically InfoNCE (van den Oord 2018):
$$\mathcal{L} = -\log \frac{\exp(s(z, z^+) / \tau)}{\exp(s(z, z^+) / \tau) + \sum_k \exp(s(z, z_k^-) / \tau)}$$
with similarity $s$ (typically cosine), temperature $\tau$, anchor $z$, positive $z^+$, and negatives $\{z_k^-\}$.
Positive pairs:
- SimCLR: two augmentations of the same image.
- MoCo: an image and its momentum-encoded version.
- CLIP: image and its caption.
- SigLIP: image-text pairs with sigmoid loss.
- wav2vec 2.0: masked-predicted speech features.
Negatives:
- In-batch: all other examples in the mini-batch.
- Memory bank (MoCo): a queue of past mini-batch encodings.
- Hard negatives: confused examples for stronger learning signal.
Key insight: contrastive learning provides a classification-like training signal without labelled data. The model must learn invariances (different augmentations of the same image should produce similar embeddings) and discriminations (different images, different embeddings).
Modern uses: CLIP-style multimodal foundation models, sentence embeddings (Sentence-BERT, E5, BGE), self-supervised visual representation learning (DINO, MAE). InfoNCE-style contrastive losses appear in nearly every modern self-supervised system.
Related terms: InfoNCE, CLIP, Triplet Loss
Discussed in:
- Chapter 14: Generative Models, Generative Models