Contrastive Learning, Glossary, Textbook of AI

Contrastive learning is the family of self-supervised methods that learn representations by pulling similar pairs together and pushing dissimilar pairs apart in embedding space. Foundation of modern self-supervised learning for vision, audio, multimodal models.

Loss: typically InfoNCE (van den Oord 2018):

$$\mathcal{L} = -\log \frac{\exp(s(z, z^+) / \tau)}{\exp(s(z, z^+) / \tau) + \sum_k \exp(s(z, z_k^-) / \tau)}$$

with similarity $s$ (typically cosine), temperature $\tau$, anchor $z$, positive $z^+$, and negatives $\{z_k^-\}$.

Positive pairs:

SimCLR: two augmentations of the same image.
MoCo: an image and its momentum-encoded version.
CLIP: image and its caption.
SigLIP: image-text pairs with sigmoid loss.
wav2vec 2.0: masked-predicted speech features.

Negatives:

In-batch: all other examples in the mini-batch.
Memory bank (MoCo): a queue of past mini-batch encodings.
Hard negatives: confused examples for stronger learning signal.

Key insight: contrastive learning provides a classification-like training signal without labelled data. The model must learn invariances (different augmentations of the same image should produce similar embeddings) and discriminations (different images, different embeddings).

Modern uses: CLIP-style multimodal foundation models, sentence embeddings (Sentence-BERT, E5, BGE), self-supervised visual representation learning (DINO, MAE). InfoNCE-style contrastive losses appear in nearly every modern self-supervised system.

Related terms: InfoNCE, CLIP, Triplet Loss

Discussed in:

Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).