Supervised learning is the machine-learning paradigm in which an algorithm learns a function $f: \mathcal{X} \to \mathcal{Y}$ from a training set of labelled input–output pairs $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$. The "supervision" lies in the fact that the correct answer, the label or target, is provided for every training example, allowing the model to measure its errors and adjust its parameters accordingly. Once trained, $f$ can produce predictions for new, unseen inputs drawn from the same distribution.
Formal setting
Assume training and test data are drawn i.i.d. from an unknown joint distribution $p(\mathbf{x}, y)$ over $\mathcal{X} \times \mathcal{Y}$. The learner picks $f$ from a hypothesis class $\mathcal{F}$ to minimise the expected risk
$$R(f) = \mathbb{E}_{(\mathbf{x}, y) \sim p}\bigl[\ell(f(\mathbf{x}), y)\bigr],$$
using only the empirical risk $\hat R(f) = \tfrac{1}{n} \sum_{i=1}^n \ell(f(\mathbf{x}_i), y_i)$ on the training set as a proxy. The gap $R(f) - \hat R(f)$ is the generalisation error, controlled by the complexity of $\mathcal{F}$ relative to $n$, formalised by VC dimension, Rademacher complexity, PAC-Bayes, or norm-based bounds.
Two main categories
Supervised learning problems divide into two main categories:
- Classification. The output $y$ is a discrete category. Is this email spam or not (binary)? Which of ten digits does this image show (multi-class)? What diseases does this X-ray indicate (multi-label)? The standard loss is cross-entropy: $\ell(\hat p, y) = -\log \hat p_y$. Common metrics: accuracy, precision, recall, F1, AUROC, calibration.
- Regression. The output $y$ is a continuous value. Tomorrow's temperature, a house's selling price, a patient's blood-pressure response to a drug. Typical losses: mean squared error $\tfrac{1}{2}(y - \hat y)^2$ (assumes Gaussian noise), mean absolute error $|y - \hat y|$ (robust to outliers), Huber loss (a smooth blend), and quantile loss for prediction intervals.
Several variants extend these basic settings: structured prediction (output is a sequence, tree or graph), ordinal regression (ordered categories), multi-output regression, and survival analysis (regression with censored times).
Algorithms
Supervised learning algorithms range from simple to enormous:
- Linear models: linear regression, logistic regression, ridge, lasso, elastic-net.
- Generalised linear models: Poisson regression, negative-binomial regression for count data.
- Kernel methods: SVMs, kernel ridge regression, Gaussian processes.
- Tree-based: decision trees, random forests, gradient-boosted trees (XGBoost, LightGBM, CatBoost). On structured tabular data, GBDTs typically beat deep networks.
- Neural networks: MLPs, CNNs, RNNs, Transformers, the dominant approach for unstructured data (images, text, audio).
Practical successes
Most of the commercially successful AI systems deployed today, spam filters, fraud detection, recommendation systems, medical image analysis, speech recognition, machine translation, search ranking, content moderation, are built on supervised learning. Even modern large language models combine supervised pretraining (next-token prediction, which is supervised learning where the label is the next token) with supervised fine-tuning (SFT) on instruction-following data, before reinforcement-learning-from-human-feedback alignment.
Supervised learning's great strength is that, given enough high-quality labelled data, even relatively simple algorithms can produce highly accurate predictions. Its great weakness is precisely this dependence on labelled data, which is often expensive, time-consuming, or impossible to obtain at the scale modern models demand. This has driven sustained interest in semi-supervised learning (mixing labelled and unlabelled data), self-supervised learning (creating labels from the structure of the data itself, as in masked language modelling), active learning (querying labels selectively), and weak supervision (programmatically generated noisy labels).
Video
Related terms: Unsupervised Learning, Reinforcement Learning, Self-Supervised Learning, Cross-Entropy Loss, Generalisation
Discussed in:
- Chapter 4: Probability, Machine Learning Paradigms