Supervised Learning: 7.1   The supervised setup, formally

Dr Chris Paton

7.1 The supervised setup, formally

Supervised learning is the most-used flavour of machine learning in production. Anywhere a system learns to map an input $\mathbf{x}$ to a label $y$ from worked examples, you are looking at a supervised learner. The spam classifier filtering your inbox, the credit-scoring model that decided your last loan application, the radiology triage tool that flags a chest X-ray for urgent review, the recommender that picked the next song, the optical character recognition behind every cheque deposit, all of them sit on top of supervised learning. Before deep learning swallowed the headlines around 2012, almost every piece of production machine learning in the world was a supervised model from a small zoo of classical algorithms: linear regression, logistic regression, $k$-nearest neighbours, decision trees, random forests, gradient boosting, support vector machines, and naive Bayes. This chapter is a tour of that zoo, plus the evaluation and deployment habits that make any of them useful in the real world.

This section pins down notation and goals so the rest of the chapter can refer back to a single, consistent vocabulary. The mathematical scaffolding (input space, output space, loss, risk, train/validation/test discipline) is identical whether the algorithms are classical (this chapter) or neural (Chapters 9 onwards).

Symbols Used Here

$\mathbf{x}$input

$y$label

$\mathcal{D}$training data

$\hat y$prediction

$\theta$parameters

What this chapter covers

The remaining sections of Chapter 7 form a curated tour of pre-deep-learning supervised methods, presented in roughly the order a practitioner first meets them. §7.2 covers linear regression, the workhorse for continuous-valued prediction and the model whose machinery, design matrix, normal equations, regularisation, recurs in nearly every later method. §7.3 introduces logistic regression, the standard baseline for classification and the conceptual ancestor of the output layer of every modern neural classifier. §7.4 generalises both into the family of generalised linear models, the unified statistical framework that links regression and classification through a chosen link function and exponential-family likelihood.

§7.5 turns to $k$-nearest neighbours, the simplest non-parametric method and the cleanest illustration of how local geometry can substitute for an explicit model. §7.6 develops decision trees, which split the input space recursively and form the building blocks of every modern tabular system. §7.7 covers naive Bayes, the model behind the original spam filters and a useful reminder that strong assumptions, when honest, can deliver very strong baselines very cheaply. §7.8 introduces support vector machines and kernel methods, the dominant classifier of the late 1990s and 2000s, and the apparatus that allowed linear models to behave non-linearly without computing the non-linear features explicitly.

§7.9 zooms out to model evaluation: cross-validation, confusion matrices, ROC and precision–recall curves, calibration, the difference between metric and loss. §7.10 covers ensemble methods, bagging, random forests and gradient boosting, which combine many weak learners into the most reliable tabular performers we know. §7.11 handles the multi-class and multi-label extensions that real applications demand. §7.12 collects a practical workflow for moving from raw data to a trustworthy model. §7.13 is a cheat sheet for picking an algorithm, §7.14 a small comparative experiment, and §7.15 and §7.16 wrap up with summary and exercises.

By the end of the chapter you should be able to read a description of a tabular machine-learning problem and say, almost without thinking, which one or two algorithms you would try first, what baseline you would compare against, what metric you would optimise, and how you would split the data to get an honest estimate of generalisation.

When to use which algorithm

Most of the algorithms in this chapter overlap in capability, and a working data scientist learns to pick between them by reflex. The choice depends mainly on the kind of data, the amount of data, and how much the user needs to explain the resulting model. A small set of rules of thumb covers most situations.

For tabular regression, predicting a continuous number from a fixed set of named features, gradient boosting (§7.10), implemented in libraries such as XGBoost, LightGBM or CatBoost, is the default winner. It handles non-linear interactions, missing values and mixed feature types out of the box, and almost always outperforms a single tree, a random forest or a neural network on tabular data of moderate size. Always fit a plain linear regression as a baseline alongside it; if the linear model is within a hair's breadth of the boosted model, that is itself a useful finding.

For tabular classification, the same advice holds: gradient boosting first, logistic regression as the baseline. The boosting model gives the strongest accuracy; the logistic regression provides interpretable coefficients and a sanity check.

For small datasets and simple problems, plain linear or logistic regression, $k$-nearest neighbours and a single decision tree are often sufficient. With under a few thousand examples, an elaborate model is more likely to overfit than to discover a real pattern, and a transparent model is much easier to debug.

For mixed data types where interactions matter, say, age combined with smoking status combined with blood pressure category, tree-based ensembles dominate, because the tree structure naturally encodes interactions without the analyst having to specify them.

For high-dimensional perceptual data with structure, pixels in an image, samples in an audio waveform, characters in text, the classical methods are usually outclassed. Convolutional networks (Chapter 11), recurrent networks and transformers (Chapters 12–13) and the modern generative architectures (Chapters 14–15) take over. A naive Bayes spam filter is fine for keyword-level email triage but cannot compete with a transformer on document classification.

For interpretability above all else, for example, a clinical risk score that has to be defended in a regulatory submission or explained at the bedside, a single decision tree pruned to a handful of leaves, or a logistic regression with a small number of carefully chosen features, will outperform a black-box model on the criterion that matters: convincing a clinician or an auditor. Accuracy is only useful in proportion to the trust the user is willing to place in the prediction.

The 80% rule

A useful working principle is what practitioners sometimes call the 80% rule: for most production machine-learning problems, a well-tuned classical algorithm will deliver around 80% of the optimal performance in a fraction of the engineering time and at a fraction of the compute cost. Deep learning's wins are real, but they are concentrated in three areas: high-dimensional perceptual data (images, audio, raw text), structured-output tasks (translation, speech recognition, code generation) and problems with abundant labelled data, ideally in the millions of examples. The bread-and-butter of industrial machine learning, invoice line items, transaction logs, sensor readings, hospital admission records, supermarket basket data, remains overwhelmingly the territory of gradient boosting and logistic regression.

The practical rule of thumb is therefore simple. Tabular data: gradient boosting. Perceptual data: deep learning. Memorise this distinction; it will save you weeks of misdirected effort. If your features are columns in a spreadsheet, start with a linear or logistic regression and a gradient-boosted tree, and only consider a neural network once those have been tuned and a clear performance gap remains. If your features are pixels, waveforms, character sequences or graph nodes, start with a deep model that respects the structure of that data.

The same 80% rule applies to time. The classical pipeline, fit, evaluate, deploy, runs end to end in an afternoon on a single laptop. A deep learning project realistically requires a GPU, a curated dataset of meaningful size, an experiment-tracking discipline, and several iterations before the first plausible model emerges. For an internal dashboard or a one-off analysis, that overhead is rarely justified. For a flagship product feature shipping to millions of users, it often is. Match the tool to the stakes.

The pre-deep-learning era as foundation

The algorithms in this chapter are not historical curiosities. In 2026 they remain the workhorses of applied machine learning for several distinct reasons.

Baselines. Every new project begins by establishing how hard the problem actually is. A linear or logistic regression, fitted in seconds, sets the floor. A gradient-boosted tree, fitted in minutes, sets a strong reference point. A neural network is then justified only if it materially exceeds these baselines.

Production deployment when state-of-the-art is unnecessary. Many problems do not need the last percent of accuracy. A credit-risk model that beats a logistic regression by half a percentage point is rarely worth the operational and regulatory complexity of maintaining a deep learning pipeline.

Explanation. A logistic-regression coefficient is directly interpretable: a one-unit increase in feature $x_j$ multiplies the odds of the positive class by $\exp(\beta_j)$. A decision tree can be drawn on a single page and read by a non-specialist. A transformer's attention weights, by contrast, are notoriously hard to translate into a faithful explanation of why a particular prediction was made. In regulated domains, credit, insurance, medicine, interpretability is often a hard constraint, not a preference.

Speed. A gradient-boosted tree predicts in microseconds on commodity CPU. A 70-billion-parameter language model takes seconds and requires a GPU. For high-throughput scoring, fraud detection on every credit-card swipe, ranking the next item in a search results page, the classical method is the only viable choice.

Robustness. Decades of practice have mapped out exactly how these algorithms fail: where they overfit, what they do under covariate shift, how they degrade under missing inputs. The failure modes of deep models are still being charted.

Setup for the chapter

Throughout the rest of the chapter we use a single, consistent notation. The training set $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ has $n$ examples, each with $d$ input features. We collect the inputs into a matrix $\mathbf{X}$ of shape $(n, d)$, where row $i$ is the feature vector $\mathbf{x}_i^\top$, and the labels into a vector $\mathbf{y}$ of shape $(n,)$. A model is a function $f_\theta$ with parameters $\theta$ that produces a prediction $\hat y = f_\theta(\mathbf{x})$.

To estimate how well a fitted model will perform on data it has not seen, we either set aside a single hold-out test set at the very start of the project and touch it only once, or we use $k$-fold cross-validation, which rotates the held-out fold through the data so every example serves as both training and test data.

We measure quality with a metric appropriate to the task. For regression we use mean squared error (MSE) or mean absolute error (MAE). For classification we use accuracy when classes are balanced and costs are symmetric, the F1 score when the positive class is rare or the cost of false positives differs from the cost of false negatives, and the area under the ROC curve (AUC) when we care about the model's ranking of examples rather than a single decision threshold. The metric is the quantity we ultimately care about; the loss is the (often differentiable) proxy we optimise during training. They need not coincide, and §7.9 examines that gap in detail.

A few further pieces of vocabulary recur throughout the chapter. The true or population risk of a model $f$ under loss $\ell$ is the expected loss over the underlying data distribution: $R(f) = \mathbb{E}_{(\mathbf{x},y)}[\ell(y, f(\mathbf{x}))]$. We cannot compute it directly because we do not know the distribution. We compute instead the empirical risk $\hat R(f) = \frac{1}{n} \sum_i \ell(y_i, f(\mathbf{x}_i))$ and rely on it as a proxy. The generalisation gap $R(f) - \hat R(f)$ is the central object of statistical learning theory and the reason validation and test sets exist. The Bayes optimal predictor $f^*$ is the model that achieves the lowest possible population risk; for squared error it is the conditional mean $\mathbb{E}[y \mid \mathbf{x}]$ and for zero-one loss it is the most-probable class given the input. No algorithm, classical or deep, can do better than $f^*$.

What you should take away

Supervised learning means learning a mapping $\mathbf{x} \to y$ from labelled examples, and it is the dominant paradigm of production machine learning.
The empirical risk on the training set is what we can compute; the population risk on unseen data is what we actually care about, and the gap between them is the whole story of generalisation.
The Bayes optimal predictor sets a floor, the irreducible noise in the problem, that no algorithm can beat.
The clean separation of training, validation and test sets is the single most important habit in applied machine learning: violate it and your performance estimates are fiction.
Tabular data, start with gradient boosting; perceptual data, start with deep learning, this rule alone resolves the algorithm choice for the majority of real-world projects.