The triplet loss (Schroff, Kalenichenko & Philbin 2015, FaceNet) trains an embedding function $f$ such that an anchor $a$ is closer to a positive example $p$ (same class) than to a negative example $n$ (different class) by at least a margin $m$:
$$L(a, p, n) = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + m)$$
Zero when $\|f(a) - f(n)\|^2 \geq \|f(a) - f(p)\|^2 + m$, the negative is further than the positive by at least margin. Otherwise grows linearly with the margin violation.
Trained by stochastic gradient descent on triplets $(a, p, n)$ sampled from the data.
Triplet mining: choosing which triplets to train on is critical.
- All triplets: $O(N^3)$ in dataset size, usually impossible to use all of them.
- Random triplets: many are easy (well-separated already, zero loss, no gradient).
- Hard negatives: choose negatives that violate the margin most (closest non-class examples). Provides strong gradient signal but can lead to training collapse.
- Semi-hard negatives (FaceNet): negatives that are further than the positive but still within the margin. Stable training; the standard heuristic.
- Online triplet mining: select triplets within each mini-batch dynamically.
Variants:
N-pair loss: generalises to one positive vs $N - 1$ negatives:
$$L = \log\!\left(1 + \sum_{j \neq i} \exp(s_{ij} - s_{ii})\right)$$
where $s_{ij}$ is the similarity between anchor $i$ and example $j$. Equivalent to multi-class cross-entropy on the similarity scores.
Contrastive loss (Hadsell, Chopra, LeCun 2006), pairs rather than triplets:
$$L(x_1, x_2, y) = y \|f(x_1) - f(x_2)\|^2 + (1 - y) \max(0, m - \|f(x_1) - f(x_2)\|)^2$$
where $y = 1$ if the pair are similar, $y = 0$ otherwise.
InfoNCE / contrastive cross-entropy: the dominant modern formulation, used in SimCLR, MoCo, CLIP, see InfoNCE entry.
Applications:
- Face recognition (FaceNet, ArcFace).
- Person re-identification.
- Image retrieval.
- Speaker verification.
- Metric learning more broadly.
- Contrastive learning for self-supervised pre-training (modern deep learning's most successful self-supervised paradigm).
Modern self-supervised methods (SimCLR, MoCo, DINO, CLIP) typically use InfoNCE rather than triplet loss because InfoNCE incorporates many negatives simultaneously and is more sample-efficient.
Related terms: Hinge Loss, InfoNCE, Contrastive Learning
Discussed in:
- Chapter 7: Supervised Learning, Loss Functions