InfoNCE, Glossary, Textbook of AI

InfoNCE (van den Oord, Li & Vinyals 2018) is the contrastive loss that powers modern self-supervised learning. For an anchor $x$, a single positive $x^+$ (semantically similar) and a set of $K$ negatives $\{x_k^-\}$:

$$L_\mathrm{InfoNCE} = -\log \frac{\exp(s(f(x), f(x^+)) / \tau)}{\exp(s(f(x), f(x^+)) / \tau) + \sum_k \exp(s(f(x), f(x_k^-)) / \tau)}$$

where $s$ is a similarity function (typically cosine similarity $s(u, v) = u^\top v / (\|u\| \|v\|)$), $\tau > 0$ is the temperature, and $f$ is the embedding network.

The loss is the categorical cross-entropy of identifying the positive among $K + 1$ candidates. Minimising it is equivalent to maximising a lower bound on the mutual information $I(f(x), f(x^+))$, hence "Info"-NCE.

The temperature $\tau$ controls how sharply the similarities matter:

Small $\tau$: hard negatives dominate; gradient mostly from confused negatives.
Large $\tau$: softer; all negatives contribute roughly equally.

Typical $\tau \in [0.05, 0.5]$.

The number of negatives matters enormously for InfoNCE's effectiveness:

SimCLR uses all other examples in the mini-batch (effectively $K = 2N - 2$ for batch size $N$).
MoCo maintains a queue of past mini-batch encodings as negatives ($K$ up to 65k+).
CLIP uses a 32k batch size during training, giving each anchor 32k - 1 negatives.

InfoNCE applications:

SimCLR (Chen et al. 2020): contrastive learning of image representations from augmented views of the same image. Two augmentations $(x_i, x_j)$ of the same image are positives; all other batch examples are negatives. State-of-the-art self-supervised vision pretraining at the time.

MoCo (He, Fan, Wu, Xie, Girshick 2019): introduces a momentum encoder and queue of negatives, decoupling negative count from batch size.

CLIP (Radford et al. 2021): contrastive vision-language pretraining. For a batch of (image, caption) pairs, image-caption pairs from the same example are positives; cross-pairings are negatives. Trained on 400M (image, text) pairs from the web. The foundation of modern multimodal AI, DALL-E, Stable Diffusion, GPT-4V all use CLIP-style joint embedding.

Word2vec's negative sampling is a precursor of InfoNCE, categorical cross-entropy of identifying the true context word among $K$ random words.

DPO (Direct Preference Optimization for RLHF) is mathematically equivalent to a special case of InfoNCE applied to preference data.

The lower-bound interpretation: InfoNCE is a lower bound on $I(x, x^+) - \log K$. Maximising it improves mutual information between anchor and positive, but the bound becomes tighter (and the estimator more accurate) as $K$ grows, explaining why more negatives improve representation quality.

Video

Discussed in:

Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).