Glossary

Contrastive Learning

Contrastive learning is the family of self-supervised methods that learn representations by pulling similar pairs together and pushing dissimilar pairs apart in embedding space. Foundation of modern self-supervised learning for vision, audio, multimodal models.

Loss: typically InfoNCE (van den Oord 2018):

$$\mathcal{L} = -\log \frac{\exp(s(z, z^+) / \tau)}{\exp(s(z, z^+) / \tau) + \sum_k \exp(s(z, z_k^-) / \tau)}$$

with similarity $s$ (typically cosine), temperature $\tau$, anchor $z$, positive $z^+$, and negatives $\{z_k^-\}$.

Positive pairs:

  • SimCLR: two augmentations of the same image.
  • MoCo: an image and its momentum-encoded version.
  • CLIP: image and its caption.
  • SigLIP: image-text pairs with sigmoid loss.
  • wav2vec 2.0: masked-predicted speech features.

Negatives:

  • In-batch: all other examples in the mini-batch.
  • Memory bank (MoCo): a queue of past mini-batch encodings.
  • Hard negatives: confused examples for stronger learning signal.

Key insight: contrastive learning provides a classification-like training signal without labelled data. The model must learn invariances (different augmentations of the same image should produce similar embeddings) and discriminations (different images, different embeddings).

Modern uses: CLIP-style multimodal foundation models, sentence embeddings (Sentence-BERT, E5, BGE), self-supervised visual representation learning (DINO, MAE). InfoNCE-style contrastive losses appear in nearly every modern self-supervised system.

Related terms: InfoNCE, CLIP, Triplet Loss

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).