Self-Supervised Learning, Glossary, Textbook of AI

Self-supervised learning (SSL) is a paradigm in which a model generates its own supervisory signal from the structure of unlabelled data, sidestepping the need for human-annotated labels. The canonical example is next-token prediction in language modelling: given a sentence, the "label" is simply the next word, freely available in any text corpus. The model is trained as if it were supervised, minimising a cross-entropy loss against ground-truth labels, but the supervision comes for free from the data itself. SSL sits between classical unsupervised learning (no labels, no supervised loss) and supervised learning (human-curated labels), and in modern practice has effectively absorbed both.

Pretext tasks

A self-supervised method is defined by its pretext task: an auxiliary objective whose solution requires learning useful representations.

Language:
- Causal (autoregressive) language modelling: predict $p(x_t \mid x_{\lt t})$. Used by GPT, LLaMA, PaLM.
- Masked language modelling (MLM): mask 15% of tokens and predict them from context. Used by BERT, RoBERTa, DeBERTa.
- Span-corruption / denoising: as in T5, BART.
Vision:
- Contrastive learning: pull together two augmented views of an image and push apart views of different images (SimCLR, MoCo).
- Non-contrastive self-distillation: predict a teacher's output without negatives (BYOL, DINO).
- Masked image modelling: reconstruct masked patches (MAE, BEiT, SimMIM).
Speech: wav2vec 2.0, HuBERT mask spans of audio features and predict quantised targets.
Multimodal: CLIP, ALIGN, SigLIP train image and text encoders jointly to align image–caption pairs via a contrastive loss.
Time series, graphs, code: analogous masking, contrasting and prediction objectives.

Why it works

The scaling laws identified by Kaplan et al. (2020) and Hoffmann et al. (2022) show that loss decreases as a power law in model size, dataset size and compute. Self-supervision is the only paradigm that can supply enough labels to ride those laws: the internet contains trillions of words and billions of images, all unlabelled. By extracting supervision from the data itself, SSL turns the open web into an effectively unlimited training set.

Pretrain-then-adapt

Self-supervised pretraining followed by fine-tuning, prompting or in-context learning has become the dominant recipe across modern AI. The pretrained foundation model captures general statistical structure of the data; lightweight downstream adaptation specialises it to a task with orders of magnitude less labelled data.

History and significance

Earlier instances include autoencoders (Rumelhart, Hinton, Williams 1986), word2vec (Mikolov et al. 2013), skip-thought vectors and language modelling for NLP transfer (Dai & Le 2015). The term "self-supervised" was popularised by Yann LeCun around 2018, who argued it was the missing ingredient for human-like learning. The GPT and BERT breakthroughs of 2018–2020 turned SSL into the engine of modern AI. It has effectively collapsed the practical distinction between "unsupervised" and "supervised" learning and underpins nearly every contemporary foundation model.

Related terms: Foundation Model, Language Model, BERT, GPT, CLIP, Contrastive Learning

Discussed in:

Chapter 12: Sequence Models, Foundation Models and Self-Supervision

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.