Supervised Learning

Dr Chris Paton

Learning Objectives

Fit linear and logistic regression models and interpret their coefficients
Build and prune decision trees and explain the information-gain splitting criterion
Use support vector machines and kernels to separate classes with maximum-margin boundaries
Apply k-nearest-neighbour classification and discuss the bias–variance implications of the choice of k
Combine weak learners into ensembles (bagging, boosting, random forests) to improve generalisation

You have a dataset of houses with their prices. You want to predict the price of a new house you have never seen. That is supervised learning: you give the algorithm labelled examples, inputs paired with correct answers, and it learns to predict the answer for new inputs.

More formally, you start with training pairs $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ drawn independently from some unknown joint distribution $p(\mathbf{x}, y)$, and you learn a function $f: \mathcal{X} \to \mathcal{Y}$ that maps inputs to outputs. The "supervision" is the label $y$ that comes with every example. This contrasts with unsupervised learning (no labels) and reinforcement learning (only delayed, scalar rewards).

The algorithms in this chapter look very different from each other, closed-form linear algebra, geometric margins, tree-based combinatorics. But they all follow the same workflow: pick a model family $\mathcal{F}$, choose a loss function $\ell(y, \hat y)$ that measures error, and find $f \in \mathcal{F}$ that minimises the empirical risk $\hat R(f) = \frac{1}{n}\sum_i \ell(y_i, f(\mathbf{x}_i))$ while still generalising to new data. The shared challenges, overfitting, underfitting, and the bias–variance trade-off, cut across every method.

This chapter is the workhorse of classical machine learning. We start from linear regression, the simplest model with the most theory, and work our way up through logistic regression, generalised linear models, instance-based learning, decision trees, naive Bayes, support vector machines, and finally ensembles. Along the way we will derive every loss function from first principles, work through Python implementations from scratch, and put each method on a fair test bed for comparison.

Textbook of AI

Supervised Learning

In this chapter