Triplet Loss, Glossary, Textbook of AI

The triplet loss (Schroff, Kalenichenko & Philbin 2015, FaceNet) trains an embedding function $f$ such that an anchor $a$ is closer to a positive example $p$ (same class) than to a negative example $n$ (different class) by at least a margin $m$:

$$L(a, p, n) = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + m)$$

Zero when $\|f(a) - f(n)\|^2 \geq \|f(a) - f(p)\|^2 + m$, the negative is further than the positive by at least margin. Otherwise grows linearly with the margin violation.

Trained by stochastic gradient descent on triplets $(a, p, n)$ sampled from the data.

Triplet mining: choosing which triplets to train on is critical.

All triplets: $O(N^3)$ in dataset size, usually impossible to use all of them.
Random triplets: many are easy (well-separated already, zero loss, no gradient).
Hard negatives: choose negatives that violate the margin most (closest non-class examples). Provides strong gradient signal but can lead to training collapse.
Semi-hard negatives (FaceNet): negatives that are further than the positive but still within the margin. Stable training; the standard heuristic.
Online triplet mining: select triplets within each mini-batch dynamically.

Variants:

N-pair loss: generalises to one positive vs $N - 1$ negatives:

$$L = \log\!\left(1 + \sum_{j \neq i} \exp(s_{ij} - s_{ii})\right)$$

where $s_{ij}$ is the similarity between anchor $i$ and example $j$. Equivalent to multi-class cross-entropy on the similarity scores.

Contrastive loss (Hadsell, Chopra, LeCun 2006), pairs rather than triplets:

$$L(x_1, x_2, y) = y \|f(x_1) - f(x_2)\|^2 + (1 - y) \max(0, m - \|f(x_1) - f(x_2)\|)^2$$

where $y = 1$ if the pair are similar, $y = 0$ otherwise.

InfoNCE / contrastive cross-entropy: the dominant modern formulation, used in SimCLR, MoCo, CLIP, see InfoNCE entry.

Applications:

Face recognition (FaceNet, ArcFace).
Person re-identification.
Image retrieval.
Speaker verification.
Metric learning more broadly.
Contrastive learning for self-supervised pre-training (modern deep learning's most successful self-supervised paradigm).

Modern self-supervised methods (SimCLR, MoCo, DINO, CLIP) typically use InfoNCE rather than triplet loss because InfoNCE incorporates many negatives simultaneously and is more sample-efficient.

Related terms: Hinge Loss, InfoNCE, Contrastive Learning

Discussed in:

Chapter 7: Supervised Learning, Loss Functions

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).