Chapter Seven

Supervised Learning

Learning Objectives
  1. Fit linear and logistic regression models and interpret their coefficients
  2. Build and prune decision trees and explain the information-gain splitting criterion
  3. Use support vector machines and kernels to separate classes with maximum-margin boundaries
  4. Apply k-nearest-neighbour classification and discuss the bias–variance implications of the choice of k
  5. Combine weak learners into ensembles (bagging, boosting, random forests) to improve generalisation

You have a dataset of houses with their prices. You want to predict the price of a new house you have never seen. That is supervised learning: you give the algorithm labelled examples — inputs paired with correct answers — and it learns to predict the answer for new inputs.

More formally, you start with training pairs {(x1, y1), …, (xn, yn)} and learn a function f that maps inputs to outputs. The "supervision" is the label y that comes with every example. This contrasts with unsupervised learning (no labels) and reinforcement learning (only delayed, scalar rewards).

The algorithms in this chapter look very different from each other — closed-form linear algebra, geometric margins, tree-based combinatorics. But they all follow the same workflow: pick a model family, choose a loss function that measures error, and optimise to find the model that minimises that loss while still generalising to new data. The shared challenges — overfitting, underfitting, and the bias–variance trade-off — cut across every method.

7.1   Linear Regression

Linear regression is the starting point for all of predictive modelling. You model a continuous target y as a linear function of the features: y = w^T^x + b. The vector w holds the weights; b is the bias (intercept). The model assumes that the expected value of y is a weighted sum of the input features. This is both its strength — interpretable, analytically tractable — and its limitation, since many real relationships are nonlinear.

Fitting with Least Squares

You fit the model by minimising the sum of squared errors: L(w) = Σi(yiw^T^xi − b)^2^. Because this loss is a convex quadratic, you can solve for the optimum in closed form using the normal equations: w* = (X^T^X)^−1^X^T^y. This takes O(d^3^) time, where d is the number of features. For large-scale problems, gradient descent reaches the same answer without forming the full matrix.

Regularisation

When you have many features relative to examples, OLS overfits. Regularisation fixes this by penalising large weights:

  • Ridge regression (L2): adds λ‖w2^2^ to the loss. Shrinks all coefficients toward zero.
  • Lasso (L1) Tibshirani, 1996: adds λ‖w1. Drives some coefficients exactly to zero, performing automatic feature selection.
  • Elastic net: combines both penalties.

Why Linear Regression Endures

Under the Gauss–Markov assumptions — linearity, independence, constant variance, zero-mean errors — OLS is the best linear unbiased estimator (BLUE). When errors are normal, you get exact confidence intervals and hypothesis tests from the t- and F-distributions. Each coefficient tells you the marginal effect of its feature. This transparency is why linear regression remains a workhorse in science, economics, and engineering.

Practical Considerations

Check your assumptions. Residual plots reveal nonlinearity. QQ plots test normality. Cook's distance flags influential outliers. If the relationship is nonlinear, you can add polynomial terms, interaction terms, or spline features — extending the model's flexibility while keeping the linear algebra machinery. The "linear" in linear regression means linear in the parameters, not necessarily in the features.

7.2   Logistic Regression

Despite the name, logistic regression is a classifier, not a regression method. It models the probability that an input belongs to a class by passing a linear combination through the sigmoid function: P(y = 1 | x) = σ(w^T^x + b) = 1 / (1 + e^−(wT^x + b)). The sigmoid squashes the output into (0, 1), giving you a probability. The decision boundary — where the probability equals 0.5 — is a hyperplane.

Training

You maximise the log-likelihood of the observed labels, which is the same as minimising the binary cross-entropy: L(w) = −Σi[yi log pi + (1 − yi) log(1 − pi)]. There is no closed-form solution, but the loss is convex, so gradient-based methods find the global minimum. Newton's method (IRLS) converges fast on small datasets; stochastic gradient descent handles large ones.

Why Probabilities Matter

Unlike a hard classifier that just says "yes" or "no," logistic regression tells you how confident it is. This is essential in applications like medical diagnosis and fraud detection, where the cost of a false negative is very different from the cost of a false positive. You can adjust the threshold away from 0.5 to trade off precision against recall.

The coefficients have a clean interpretation: a unit increase in feature xj multiplies the odds of the positive class by exp(wj). This makes logistic regression the standard tool in epidemiology and the social sciences.

Multi-Class Extension

For K > 2 classes, you use the softmax function: P(y = k | x) = exp(wk^T^x) / Σj exp(wj^T^x). Each class gets its own weight vector. The loss becomes categorical cross-entropy, and the optimisation stays convex. Softmax reappears throughout deep learning as the standard output layer for multi-class problems.

Regularisation (L2 or L1) is just as important here. Without it, if the classes are perfectly separable, the weights can grow without bound as the sigmoid saturates.

7.3   Decision Trees

A decision tree splits the data into rectangular regions, one question at a time. At each node, it picks a feature and a threshold that best separate the targets. The result is a tree of simple rules: "if age > 50 and cholesterol > 200, predict high risk." Every path from root to leaf is a conjunction of human-readable conditions. This transparency makes trees invaluable in healthcare and finance, where you need to explain your decisions.

Splitting Criteria

The key question is: which split is best? For classification, the two standard measures are:

  • Gini impurity: G = Σk pk(1 − pk). Zero when the node is pure.
  • Information gain: the reduction in Shannon entropy from the split.

For regression trees, you maximise the reduction in variance.

Pruning

An unpruned tree can grow until every leaf has a single example — zero training error, but massive overfitting. You control this with:

  • Pre-pruning: limit depth, require a minimum number of samples per leaf, or stop when the gain is too small.
  • Post-pruning (cost-complexity pruning): grow the full tree, then iteratively remove subtrees that hurt validation performance the least.

Strengths and Weaknesses

Trees need no feature scaling. They handle mixed types (numerical and categorical) naturally. They ignore irrelevant features — if a feature never appears in a split, it has no effect. Missing values can be handled with surrogate splits.

The main weaknesses are high variance — a small change in the data can produce a completely different tree — and axis-aligned boundaries, which need many splits to approximate diagonal or curved decision surfaces. Ensembles (Section 7.6) address both problems.

7.4   Support Vector Machines

Support vector machines (SVMs) Cortes, 1995 find the hyperplane that separates two classes with the widest margin. The margin is the gap between the boundary and the nearest points on each side. Those nearest points are the support vectors — they alone define the boundary. Statistical learning theory shows that wider margins lead to better generalisation.

Hard and Soft Margins

The hard-margin SVM solves: minimise ½‖w‖^2^ subject to yi(w^T^xi + b) ≥ 1. The margin is 2/‖w‖, so minimising ‖w‖ maximises it.

Real data is rarely perfectly separable. The soft-margin version adds slack variables ξi that let some points violate the margin, penalised by a parameter C. Large C means narrow margin, complex boundary. Small C means wide margin, simpler model.

The Kernel Trick

The real power of SVMs comes from kernels. The dual formulation depends only on pairwise dot products xi^T^xj. Replace these with a kernel function K(xi, xj) and you implicitly map the data into a higher-dimensional space where a linear separator exists — without ever computing that mapping.

Common kernels:

  • Polynomial: K(x, z) = (x^T^z + c)^d^
  • RBF (Gaussian): K(x, z) = exp(−γ‖xz‖^2^). Maps to infinite dimensions.

Computational Cost

Training an SVM means solving a quadratic programme — O(n^3^) in general. The SMO algorithm Platt, 1998 broke this into small subproblems solvable in closed form, making SVMs practical. But kernel SVMs still scale poorly to very large datasets, which is one reason deep learning has taken over for many tasks.

Where SVMs Still Shine

SVMs work exceptionally well on small-to-medium datasets in high dimensions — text classification, bioinformatics, and genomics. The max-margin objective provides built-in regularisation. Multi-class problems are handled by one-vs-one or one-vs-rest decomposition. Probability estimates come from Platt scaling (fitting a sigmoid to the decision values).

7.5   K-Nearest Neighbours

KNN is the simplest supervised algorithm: to classify a new point, find the k closest training examples and let them vote. For regression, average their values. There is no training phase — you just store the data and do all the work at prediction time. This makes KNN a lazy learner, in contrast to eager learners like linear regression that build a compact model upfront.

The Role of k

The choice of k controls the bias–variance trade-off directly:

  • k = 1: the boundary perfectly memorises the training data. Low bias, high variance — overfitting.
  • Large k: the boundary smooths out. Higher bias, lower variance.
  • k = n: always predicts the majority class. Maximum bias.

A common heuristic is k ≈ √n, but cross-validation is the reliable way to choose.

Distance Metrics

The default is Euclidean distance, but it treats all features equally and is sensitive to scale — normalise your features first. Manhattan distance (L1) is more robust to outliers. In high dimensions, all distances converge (the curse of dimensionality), and KNN starts to struggle. Dimensionality reduction or learned metrics can help.

Speed

The naïve approach computes n × d distances per query. Data structures can help:

  • KD-trees: O(d log n) in low dimensions, but degrade in high dimensions.
  • Ball trees: better for moderate dimensions.
  • Approximate methods (locality-sensitive hashing): trade exactness for speed.

Theory

Cover and Hart (1967) Cover, 1967 proved a beautiful result: as n → ∞, the 1-nearest-neighbour error is at most twice the Bayes optimal error rate. So with enough data, KNN is competitive with the best possible classifier. In practice, class imbalance, irrelevant features, and computational cost limit its usefulness for large-scale problems.

7.6   Ensemble Methods

A single decision tree is unstable — small data changes produce different trees. But combine many trees and the noise cancels out. That is the core insight behind ensembles: a collection of individually weak models can, when combined, produce predictions that are far more accurate and robust than any single member.

The formal justification comes from the bias–variance decomposition. If the base learners' errors are sufficiently uncorrelated, averaging reduces variance without increasing bias.

Bagging and Random Forests

Bagging (bootstrap aggregating) Breiman, 1996 trains each model on a different bootstrap sample — a random draw with replacement from the training set. Predictions are averaged (regression) or decided by majority vote (classification).

Random forests Breiman, 2001 add a second layer of randomness: at each split, only a random subset of features is considered. This decorrelates the trees further. The default is m = √d features for classification and m = d/3 for regression.

Random forests are remarkably practical:

  • Adding more trees never hurts (assuming you can afford the compute)
  • They handle mixed feature types
  • They provide free feature-importance estimates
  • Out-of-bag error gives you a validation estimate without a separate holdout set

Boosting

Boosting builds the ensemble sequentially. Each new model focuses on the examples the current ensemble gets wrong.

AdaBoost Freund, 1997 upweights misclassified examples so the next learner concentrates on the hard cases. The final prediction is a weighted vote. As long as each base learner beats random guessing, the training error drops exponentially.

Gradient boosting Friedman, 2001 generalises this by fitting each new tree to the negative gradient of the loss — the "pseudo-residuals." For squared-error loss, these are just the residuals themselves. For other losses (log-loss, quantile loss), the framework adapts automatically.

Modern implementations — XGBoost Chen, 2016, LightGBM, CatBoost — add second-order gradients, histogram-based binning, and regularisation. They dominate machine learning competitions on tabular data.

Stacking

Stacking feeds the outputs of several diverse base models as features into a meta-learner that learns how to combine them. To avoid overfitting, the base-level predictions are generated via cross-validation: each base model predicts the held-out fold, and these out-of-fold predictions form the meta-training set. This lets the meta-learner see honest predictions rather than ones the base models memorised.