Unsupervised Learning: 8.17   The shift to self-supervised learning

Dr Chris Paton

8.17 The shift to self-supervised learning

For most of this chapter we have done unsupervised learning the old way: cluster the data, reduce its dimension, fit a density to it, find the points that do not fit. These methods were the only game in town when the labels were missing, and they still earn their keep on tabular data, single-cell biology, customer segmentation and quick-and-dirty exploration. But if you have looked at any modern AI system in the last few years, ChatGPT, Claude, BERT-powered search, CLIP-driven image retrieval, Whisper transcription, Stable Diffusion, and wondered how it learned so much without an army of human labellers, the answer is self-supervised learning (SSL).

The idea sounds almost like a trick when you first hear it. Self-supervised learning is unsupervised learning rebranded, with one crucial structural insight: instead of inventing a generic objective like "minimise variance" or "maximise density", you invent a task whose answer is hidden inside the input itself. You blank out part of a sentence and ask the model to fill it in. You take a picture, crop two patches from it, and ask the model to recognise that they came from the same image. You play the first half of an audio clip and ask for the next sample. The label is part of the input, so you have an infinite supply of labels for free. From that point on you train with ordinary supervised methods, cross-entropy, gradient descent, the standard machinery of Chapters 5 to 7, but at a scale that classical unsupervised methods could never dream of.

This bridge from §8.2--§8.16 to §8.17 is therefore not a technical break so much as a philosophical one. The classical methods optimise objectives that are mathematically clean but only loosely connected to anything you might actually want to do with the representation. SSL optimises objectives that are deliberately engineered to align with downstream usefulness. That alignment, plus the scale that web-sized data permits, is why the dominant pretraining paradigm of every foundation model from 2018 onwards is self-supervised. The remainder of this section gives you the principle, the major paradigms, a worked contrastive example, the CLIP case study, the economics, and a short list of takeaways.

The principle

Make labels from the input. That is the whole idea, but it deserves unpacking.

A pretext task is an artificial supervised problem whose target is computed deterministically from the raw input. Examples make this concrete. Take a sentence: "The cat sat on the mat." Hide the word cat with a special [MASK] token, hand the model "The [MASK] sat on the mat", and ask it to predict cat. The model never sees a human-supplied label; the label is just the word you removed. Take an image. Crop two random patches, apply different colour jitters and rotations, and ask the model to map both crops to similar internal vectors while mapping crops from a different image to dissimilar vectors. No human ever wrote down a label; the geometry of "same image vs different image" is the supervision. Take an audio waveform. Mask a 25 ms window and ask the model to predict it from context.

In each case the pretext loss is a perfectly ordinary supervised loss, cross-entropy, mean-squared error, a contrastive log-likelihood, but the labels were generated, not gathered. You can therefore train on the entire internet's worth of text, every photograph on Flickr, every public YouTube transcript, without paying anyone to annotate.

The pretext task is not the goal. Nobody actually cares whether the model can fill in masked words; people care whether it can answer questions, classify radiology images, or translate Welsh. The trick is that to solve the pretext task well, the model must construct representations, internal feature vectors, that capture the structure of language, vision or audio. Those representations transfer well to downstream tasks. A small labelled dataset for the real downstream task (a few thousand annotated chest X-rays, say) is then enough to fine-tune the pretrained model, or even to train a simple linear probe on top of frozen features. This two-stage recipe, pretrain on huge unlabelled data, fine-tune on a small labelled set, is the workflow behind nearly every modern AI product.

Major paradigms

Five families of pretext task account for most of what you will encounter. They are not mutually exclusive, many real systems combine two or more, but it helps to see them sorted.

Predictive (autoregressive). Predict the next token given everything before it. GPT models do this with text: $p(x_t \mid x_{\lt t})$, optimised by cross-entropy. PixelCNN and PixelRNN do it with images, scanning left to right, top to bottom. The pretext is left-to-right next-token prediction; the loss is the negative log-likelihood of the corpus. This single objective is enough, at scale, to produce a model that can write essays, draft code, explain proofs and roleplay characters. The reason it works is not magic: to predict the next word in "The mitochondrion is the . . ." the model must encode chemistry, biology, syntax, world knowledge and discourse coherence, all in the gradient signal from a billion small prediction errors.

Masked. Hide a part of the input and predict the hidden part from the visible part. BERT (2018) randomly masks 15% of tokens in a sentence and trains the model to reconstruct them. Masked autoencoders for vision (MAE, He et al. 2022) mask up to 75% of image patches and reconstruct pixels. Unlike autoregressive prediction, masked modelling is bidirectional: the model sees context on both sides of the gap, which is excellent for understanding tasks like classification, retrieval or named-entity recognition.

Contrastive. Learn that two views of the same thing are close, and views of different things are far apart. SimCLR and MoCo (both 2020) apply this idea to images, generating pairs of views via random crops, colour jitter and Gaussian blur. SimCSE (2021) applies it to sentences, using dropout itself as a stochastic augmentation. The loss is InfoNCE, which we work through in detail below.

Distillation (non-contrastive). Train two networks, a "student" and a "teacher", to produce matching representations of two augmented views, without explicit negative pairs. BYOL (2020) and DINO (2021) showed, surprisingly, that you do not need negatives at all if you stop the gradient flowing through the teacher and update the teacher as an exponential moving average of the student. DINO trained on ImageNet without labels learns features so semantically clean that you can segment objects from attention maps alone. DINOv2 (2023) and DINOv3 (2025, ~7B-parameter ViT teacher trained on 1.7B images) extend the family and beat specialised SOTA on segmentation and detection without fine-tuning. Meta's JEPA family (I-JEPA 2023, V-JEPA 2024, LeJEPA 2025) is the leading non-generative SSL alternative.

Cross-modal. Use one modality to supervise another. CLIP (Radford et al., OpenAI 2021) pairs images with the captions that happened to accompany them online and trains them to be close in a shared embedding space. ALIGN, Florence and Whisper (which pairs audio with subtitles) follow the same pattern. The pretext task here exploits the fact that the internet is already a giant, noisy, parallel corpus of (image, text), (audio, text) and (video, text) pairs. SigLIP and SigLIP 2 (Google, 2024–25) replace softmax with a sigmoid loss and have largely displaced CLIP as the default vision-language encoder, with around 84–85 per cent zero-shot ImageNet top-1. We come back to CLIP shortly.

Worked: contrastive InfoNCE

Let us walk slowly through what a contrastive trainer actually does, because the algorithm is short but the intuition repays attention.

Sample a mini-batch of $N$ images $\{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$. For each image $\mathbf{x}_i$, generate two random augmentations: random crop, horizontal flip, colour jitter, Gaussian blur. Call them $\mathbf{x}_i$ and $\mathbf{x}_i^+$. Pass every augmented image through a shared encoder (a ResNet-50, say) to produce embeddings $\mathbf{z}_i$ and $\mathbf{z}_i^+$, each $\ell_2$-normalised so that they live on the unit sphere.

The two embeddings $(\mathbf{z}_i, \mathbf{z}_i^+)$ form a positive pair: they came from the same image. Every other pairing $(\mathbf{z}_i, \mathbf{z}_j)$ with $j \neq i$ is a negative pair. The model's job is to make positives close (high cosine similarity) and negatives far apart.

The InfoNCE loss for the $i$-th example is

$$ \mathcal{L}_{\text{InfoNCE}} = -\log\frac{\exp(\mathrm{sim}(\mathbf{z}_i, \mathbf{z}_i^+)/\tau)}{\sum_j \exp(\mathrm{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau)}, $$

where $\mathrm{sim}(\mathbf{a}, \mathbf{b}) = \mathbf{a}^\top \mathbf{b}$ is cosine similarity (after $\ell_2$ normalisation), and $\tau$ is a temperature hyperparameter (typically $0.07$ to $0.5$) that sharpens the softmax. The denominator runs over the positive plus all $N-1$ negatives in the batch.

Read the formula like a $(N$-way$)$ classification problem. The numerator says: "you should match the positive". The denominator says: "and you should not match any of the negatives". As gradient descent reduces this loss, the encoder learns to map augmentations of the same image to nearby points and crops of unrelated images to distant points. After 200--800 epochs on ImageNet without using any class labels, the resulting ResNet, with a fresh linear classifier on top, reaches around 76% top-1 ImageNet accuracy, competitive with fully supervised training. Transferred to object detection and semantic segmentation, the same backbone often exceeds its supervised counterpart, because the contrastive features are less tuned to ImageNet's specific class boundaries.

Two practical notes that explain a lot of the literature. First, the loss craves negatives: bigger batch sizes (4096, 8192 or more) give more negatives per step and lift accuracy. MoCo sidesteps this by maintaining a queue of past negatives; SimCLR brute-forces it with TPU pods. Second, the augmentations matter enormously. Strip out colour jitter and accuracy collapses, because the model exploits the trivial shortcut "same colour histogram = same image".

CLIP

CLIP (Contrastive Language-Image Pretraining) is the cross-modal extension of InfoNCE, and it is worth a section of its own because of how much it changed downstream practice.

The training data is 400 million (image, caption) pairs scraped from the public web, product photos with alt text, Wikipedia infoboxes, news photos with captions, and so on. The architecture is two encoders: a vision transformer (or ResNet) for images, and a text transformer for captions. Both produce $d$-dimensional embeddings in a shared space.

Within a batch of, say, $32{,}768$ image-text pairs, every image's matching caption is its single positive, and every other caption in the batch is a negative. The same is true symmetrically: for each caption, the matching image is positive, all other images negative. The loss is the average of the image-to-text and text-to-image InfoNCE terms. Training takes thousands of GPU-days but happens once.

What you get afterwards is the headline trick: zero-shot classification. Suppose you want to classify a photograph as cat, dog or pigeon, and you have no labelled training set for that task. You write the candidate class names as natural-language prompts ("a photo of a cat", "a photo of a dog", "a photo of a pigeon"), encode each prompt with the text encoder, encode the image, and pick the class whose text embedding is most similar to the image embedding. CLIP achieves around 76% top-1 accuracy on ImageNet this way, with no ImageNet training data, competitive with a ResNet-50 trained on the full labelled ImageNet. It transfers similarly well to satellite imagery, medical images, paintings and OCR.

CLIP also became the conditioning signal for almost every text-to-image generator that followed, including Stable Diffusion, DALL-E 2 and Imagen. The shared text-image space, learned purely from web pairs, turned out to be the lingua franca of multimodal generation.

Why it matters

The economic argument is brutal and clarifying. A high-quality human label, a radiologist annotation, a translation pair, a hand-segmented road scene, costs of the order of one US dollar each. Web-scraped unlabelled data costs effectively nothing per item. If a labelled dataset of one million examples costs roughly a million dollars to assemble, the same budget yields a billion or more unlabelled examples, three orders of magnitude more.

Self-supervised learning lets you spend that hundred-thousand-fold imbalance where it has highest leverage: pretrain on the cheap unlabelled mountain, then fine-tune with the expensive labelled molehill. For most practical problems the labelled set you fine-tune on can be a thousand examples instead of a million, because the model already understands the input modality from pretraining. This shift is why a hospital with only a few thousand annotated MRIs can now build a working diagnostic model on a foundation pretrained on a billion natural images, and why a startup can ship a usable language product fine-tuned from an open-weights base for the cost of a laptop and an afternoon. Every modern foundation model, GPT, BERT, CLIP, Whisper, Stable Diffusion, DINO, MAE, LLaMA, begins life with a self-supervised pretraining run. Classical unsupervised methods still matter for exploration, visualisation and tabular data, but for representation learning at scale, self-supervised learning is the paradigm that won.

What you should take away

Self-supervised learning is unsupervised learning with a structural twist. You invent a pretext task whose label is computed from the input itself, then train with ordinary supervised machinery.
Five paradigms cover almost everything you will meet. Predictive (GPT), masked (BERT, MAE), contrastive (SimCLR, MoCo, CLIP), distillation (BYOL, DINO) and cross-modal (CLIP, Whisper).
InfoNCE is the workhorse contrastive loss. Two augmentations of the same input are positives; all other batch elements are negatives; the softmax over similarities, scaled by a temperature, is the cross-entropy you minimise.
CLIP showed that web-scale (image, caption) pairs are enough for zero-shot classification and for grounding generative models. A shared text-image embedding space, learned contrastively, transfers well across tasks.
The economics are decisive. Labelled data is dollars per item; unlabelled data is essentially free. Pretrain large on the cheap and fine-tune small on the expensive: this is the workflow behind every foundation model from 2018 onwards.