6.2 The three classical paradigms (and a fourth)

Imagine you are trying to teach a child. There are two very different ways you might do it. You could sit beside them with a pile of flashcards, hold each one up, and tell them the answer: "this is a cat, this is a dog, this is a fox". Or you could empty a bag of pebbles onto the floor and say, "see what you can make of these". The first style assumes someone has done the work of preparing a labelled answer for every example. The second style assumes nothing of the sort, and asks the learner to find pattern, similarity, and structure on its own.

This is the great divide of machine learning. On one side sits supervised learning, where every training example arrives with a human-supplied label. On the other sits unsupervised learning, where the data come bare and the algorithm must discover whatever structure it can. Modern AI does not occupy one camp or the other; it sits on a continuum between them. The most influential technique of the past decade, self-supervised learning, manufactures its own labels from the raw input and so steals the cheap data of the unsupervised world while keeping the clean training signal of the supervised one. A fourth, distinct paradigm, reinforcement learning, pulls in a different direction altogether: instead of labels, an agent receives rewards as it acts, and must figure out which behaviours pay off.

This section develops the paradigms with concrete examples and shows where each lives in the modern landscape. Chapters 7 and 8 dive into supervised and unsupervised methods.

Symbols Used Here
$\mathbf{x}$input
$y$label or output
$\mathbf{X}, \mathbf{y}$feature matrix and label vector

Supervised learning

In supervised learning we are given $n$ pairs $(\mathbf{x}_i, y_i)$ and asked to learn a function $f$ such that $f(\mathbf{x}) \approx y$ for inputs we have never seen before. The vector $\mathbf{x}_i$ is the input, the pixels of an image, the words of a sentence, the readings of a sensor, the results of a blood test. The scalar (or vector, or tree, or mask) $y_i$ is the label that a human, or some equally trustworthy source, has supplied. The whole job of the algorithm is to find the pattern that links them.

Three flavours of supervised learning recur throughout this book:

  • Regression, in which $y$ is continuous. Predicting house prices from postcode and floor area, the temperature tomorrow afternoon, a patient's age from a chest X-ray, or the next position of a robot arm, these are all regression problems. The earliest method, linear regression, was published by Legendre in 1805 and is still routinely used today.
  • Classification, in which $y$ takes one of a finite set of values. Spam vs ham, melanoma vs benign mole, the digit 0 to 9 in an MNIST image, the sentiment of a film review. Frank Rosenblatt's perceptron in 1958 was an early classifier; support vector machines dominated in the 1990s; modern convolutional networks and transformers have set state-of-the-art accuracy since the 2010s.
  • Structured prediction, in which $y$ is itself a structured object, a translated sentence, a syntactic parse tree, a segmentation map that labels every pixel of an image. Here the output space is not a scalar but combinatorial, and we usually need specialised loss functions and decoders.

Supervised learning is the most thoroughly understood and the most widely deployed paradigm. We have decades of theory describing when and why it works, an enormous toolbox of algorithms, and clear ways to measure success: hold out some labelled data, see how well your function predicts it, report the error. The catch is the cost of the labels. Every example in the training set has to be annotated by someone who already knows the right answer, a radiologist circling tumours, a translator producing parallel sentences, a moderator marking comments as toxic or benign. Labels are slow, expensive, and limited by the patience of the people producing them.

Much of the history of computer vision and natural language processing since 2010 is, in effect, the history of finding ways to scale or sidestep human labelling. ImageNet famously paid Mechanical Turk workers to annotate fourteen million images. Self-supervised pretraining, which we meet shortly, took the opposite route: avoid labels almost entirely.

Unsupervised learning

In unsupervised learning we are given only the inputs $\mathbf{x}_1, \ldots, \mathbf{x}_n$. There are no labels at all. The algorithm has to find structure in the data on its own, with no oracle to tell it whether its answer is right or wrong.

Several different kinds of structure are typically sought:

  • Clustering partitions the data into groups of similar examples. Customers with similar purchase histories, galaxies with similar spectra, news articles about the same event, patients with similar trajectories. Classical algorithms include $k$-means, Gaussian mixture models, hierarchical agglomerative clustering, and DBSCAN.
  • Dimensionality reduction projects high-dimensional data into a low-dimensional space that preserves as much of the original variation as possible. Principal component analysis (PCA), t-SNE, UMAP, and autoencoders all do this. The outputs are useful for visualisation, for compression, for denoising, and as inputs to downstream supervised tasks.
  • Density estimation fits a probability distribution $p(\mathbf{x})$ to the data. Once fitted, $p(\mathbf{x})$ can be evaluated to spot unusual points (low probability means anomaly), used to draw new samples, or plugged into a larger probabilistic system. Kernel density estimation, mixture models, and normalising flows all fit here.
  • Anomaly detection is closely related: given a corpus of "normal" examples, flag the ones that differ. Useful for fraud, fault monitoring, intrusion detection, and quality control.
  • Generative modelling is the most ambitious unsupervised task. Train on a corpus of images, audio, or text, and learn to produce new samples that look as if they came from the same source. Modern image generators and large language models are, at heart, generative models, though the methods used to train them blur into the next paradigm.

Unsupervised learning is harder than supervised learning, not because the algorithms are more complicated, but because we cannot easily say whether their answers are correct. Two people clustering the same dataset may legitimately produce different clusterings. Two people compressing the same image to two dimensions may pick out different two dimensions. Without a held-out label set we are reduced to indirect tests: does the clustering help with downstream tasks, does the compression preserve information we care about, do the generated samples look plausible? Honest evaluation in unsupervised learning is genuinely difficult.

Self-supervised learning

Self-supervised learning is the modern hybrid. Formally it is a special case of supervised learning, but the labels are not supplied by humans, they are extracted, automatically, from the input itself. Hide a piece of the input from the model, ask the model to predict it from the rest, and the hidden piece becomes a free label. Because the data source is the data itself, the supply of training examples is essentially unlimited. The scale problem of supervised learning vanishes.

The recipes that have shaped modern AI all follow this template:

  • Language modelling (the GPT family). Given the first $n$ words of a sentence, predict the next word. Every position in every text document on the internet becomes one training example. A trillion words of text yields roughly a trillion training examples, all generated for free.
  • Masked language modelling (BERT). Hide fifteen per cent of the words in a sentence at random and ask the model to fill them back in from the surrounding context. The labels are simply the original words.
  • Image inpainting and masked autoencoders (MAE). Cover up patches of an image and ask the model to paint them back in.
  • Contrastive learning (SimCLR, MoCo, CLIP). Take an image, produce two augmented views of it (a crop, a colour jitter, a flip), and train the model to place those two views close together in its embedding space while pushing apart the views of different images. The "label" is just the identity of the source image.

This single shift, labels for free, is the engine behind foundation models. GPT, BERT, CLIP, Whisper, the Llama family, and almost every modern large model are first pretrained on enormous unlabelled corpora using a self-supervised objective, then fine-tuned on a much smaller labelled dataset for a specific downstream task. The pretraining absorbs the broad statistical regularities of language, images, or audio; the fine-tuning specialises that knowledge cheaply. A few thousand labelled medical scans can produce a useful diagnostic model when bolted onto a vision backbone that has already digested a billion ordinary photographs.

Self-supervised pretraining has, since around 2018, become the dominant paradigm in modern AI. It is the reason a single model can answer questions, write code, and translate between languages: the underlying representation was learned without a single hand-supplied label.

Reinforcement learning

Reinforcement learning is a different beast. Instead of a static dataset, an agent acts in an environment over time. At each step it observes the current state, takes an action, receives a numerical reward, and finds itself in a new state. It must learn a policy, a rule that maps states to actions, so as to maximise the long-run total reward.

Rewards are sparse and often delayed: a chess-playing agent learns only at the end of a game whether it won or lost. The agent's own actions change the data it sees next, creating an exploration–exploitation tradeoff: try something new, or stick with what already works? Reinforcement learning powers the great game-playing successes (AlphaGo, AlphaZero, OpenAI Five), drives much of robotics and continuous control, supplies the alignment step in RLHF for language models, and is the natural framework for any decision-making problem that unfolds in time.

Where each paradigm fits in modern AI

It is tempting to treat the four paradigms as competing schools, but in production they nest neatly inside one another. A modern AI system typically uses all of them.

  • Supervised learning powers the everyday workhorses: spam filters, fraud detectors, medical-image classifiers, recommendation rankers, and the fine-tuning step that turns a general foundation model into a task-specific one.
  • Unsupervised learning is mostly used early in the pipeline: as exploratory data analysis on a fresh dataset, as dimensionality reduction before a downstream model, and as anomaly detection in production monitoring.
  • Self-supervised learning is now the dominant paradigm for training large models from scratch. Every GPT-class language model, every CLIP-style vision-language system, and every modern speech model is built on a self-supervised pretraining stage. It is the single most important paradigm shift of the past decade.
  • Reinforcement learning supplies agentic behaviour where decisions unfold in time, the alignment phase in RLHF that gives large language models their helpful, polite tone, and the planning step in robotics, autonomous driving, and game-playing systems.

A typical modern foundation model is therefore self-supervised in pretraining, supervised in fine-tuning, and reinforcement-tuned in alignment, three of the four paradigms in a single training pipeline. Unsupervised methods slot in around it for exploration and monitoring. Knowing the paradigm of any given step is the quickest way to understand which problems it can solve and which it cannot.

What you should take away

  1. Machine learning is divided by what kind of feedback signal the algorithm sees during training: labels, no labels, self-extracted labels, or rewards.
  2. Supervised learning is the most mature paradigm and powers most deployed classifiers and regressors, but it is bottlenecked by the cost of human labelling.
  3. Unsupervised learning finds structure (clusters, low-dimensional projections, densities) in unlabelled data. It is harder to evaluate but indispensable for exploration, compression, and anomaly detection.
  4. Self-supervised learning is supervised learning where the labels are manufactured from the input itself. Since around 2018 it has been the dominant paradigm for pretraining foundation models such as GPT, BERT, CLIP, and Whisper.
  5. Reinforcement learning is a separate framework for learning by trial, error, and reward, and supplies both agentic behaviours and the alignment step (RLHF) that polishes modern language models.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).