6.1 What is machine learning?

Machine learning is the discipline of building computer systems that improve with data. Rather than being told, step by step, exactly what to do, an ML system is given examples and a target, and it adjusts its own internal settings so that, over time, it gets better at the job. A spam filter is shown thousands of past emails marked "spam" or "not spam"; afterwards it can sort fresh emails it has never seen before. A camera is shown millions of labelled photographs; afterwards it can recognise a cat in a picture taken last night. The thing that changes from one program to the next is not the code; it is the data. Software written this way is, in a real sense, grown rather than authored.

Tom Mitchell's 1997 textbook captured this idea in three letters: a program learns from experience $E$ with respect to a class of tasks $T$ and performance measure $P$ if its performance at $T$, as measured by $P$, improves with $E$. We will use the $(T, P, E)$ triple as shorthand for framing a concrete project; the worked example at the end of this chapter (§6.17) opens by writing it down explicitly.

The first five chapters of this book built the foundations on which that growing process rests. Linear algebra gave us the language of vectors and matrices, the way to bundle many numbers into one tidy object and push them through a function in a single step. Calculus gave us derivatives, the engine that lets a program ask, "If I nudge this parameter slightly, does my error go up or down?" Probability gave us a vocabulary for uncertainty: distributions, expectations, conditional probabilities, the idea that the world is noisy and that a good model is one that assigns sensible probabilities to what it might see next. Statistics gave us the tools for inferring quantities from samples, distinguishing signal from noise, and quantifying our confidence. Optimisation tied them together by formalising what it means to find the best values for a model's parameters, given a precise definition of "best".

This chapter is the bridge from that machinery to the practical discipline that uses it. It introduces the working vocabulary — instances, features, labels, hypothesis classes, loss functions, learning algorithms, generalisation — that every ML practitioner has internalised. The framework covers linear regression, decision trees, and gigantic deep networks under one roof; later chapters specialise.

Symbols Used Here
$\mathbf{x}$input feature vector
$y$label or output
$\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$training data
$\mathcal{H}$hypothesis class (set of candidate models)
$h$ or $f$a candidate hypothesis (a function from inputs to predictions)
$\mathcal{L}$loss function

Three paradigms

Machine learning is most easily understood as three loosely separate paradigms, distinguished by what kind of data the system is given and what it is asked to do.

Supervised learning is the most familiar. The system is shown pairs $(\mathbf{x}, y)$, an input and its correct answer. The input $\mathbf{x}$ might be the pixels of a photograph; the answer $y$ might be the label "cat". Or $\mathbf{x}$ might be the words of an email and $y$ might be "spam" or "not spam". Or $\mathbf{x}$ might be the vital signs of a patient and $y$ might be "discharged within 24 hours". The goal is to learn a function that, given a fresh input it has never seen before, predicts the right $y$. Supervised learning is what most people mean when they say "machine learning" without further qualification, and it is the workhorse of nearly every commercial AI product.

Unsupervised learning is the case where you have only inputs, no answers. You give the system a stack of $\mathbf{x}$ values and ask it to discover something interesting about their structure on its own. Common tasks include grouping the inputs into clusters that resemble each other; finding a smaller set of underlying factors that explain most of the variation in the data; or estimating the underlying probability distribution from which the inputs were drawn. Unsupervised learning is essential whenever labels are expensive, scarce or impossible to obtain, which, in real-world settings, is most of the time.

Reinforcement learning is the most ambitious of the three. Here the system is an agent embedded in an environment. The agent takes actions, the environment changes, and from time to time the environment hands back a reward, a single number that says "good move" or "bad move". The agent's job is to learn a strategy that, over many actions, accumulates as much reward as possible. Reinforcement learning is how programs learn to play chess and Go, how robots learn to walk, and how the latest large language models are nudged towards being more helpful and less harmful.

Modern practice has added a fourth category that does not fit neatly into the original three: self-supervised learning. Self-supervision is a clever trick. The data themselves contain labels if you know where to look. Take a sentence, hide the last word, and ask the model to guess it: you have just turned an unlabelled corpus into a supervised problem with billions of free training examples. Take an image, mask out a square, and ask the model to fill it in. Take a video, scramble the frames, and ask the model to put them back in order. Self-supervised learning is the engine behind today's foundation models, including the large language models discussed later in this book, and it is arguably the most consequential idea of the past decade.

The training problem, formally

Stripped to its mathematical essentials, supervised learning poses a single optimisation question. We will state it carefully, because it is the formal core of the subject, and because every algorithm later in the book is, at heart, a strategy for solving some version of it.

You are given:

  • A sample of training data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$, $n$ input–output pairs, drawn (we assume) independently from some unknown underlying distribution.
  • A hypothesis class $\mathcal{H}$, the set of candidate models you are willing to consider. Examples include "all straight lines", "all decision trees of depth at most 5", "all neural networks with this particular architecture", or, in modern practice, "all settings of the 70 billion parameters of this transformer".
  • A loss function $\mathcal{L}(h(\mathbf{x}), y)$, a precise rule for how unhappy you are when the model predicts $h(\mathbf{x})$ but the truth was $y$. For regression, the squared error $(h(\mathbf{x}) - y)^2$ is canonical. For classification, the cross-entropy is standard.

Your task is to choose:

$$\hat h \;=\; \arg\min_{h \in \mathcal{H}} \; \frac{1}{n}\sum_{i=1}^{n} \mathcal{L}(h(\mathbf{x}_i), y_i).$$

This is empirical risk minimisation (ERM). Read in plain English: search through every candidate hypothesis you are willing to consider; for each one, compute the average loss over the training data; pick whichever scores lowest. The chosen hypothesis $\hat h$ is the one that fits the seen data best, in the sense defined by your loss function.

The trouble is that fitting the seen data is not the goal. The goal is to do well on data the model has not seen. Formally, we care about the true risk

$$R(h) \;=\; \mathbb{E}_{(\mathbf{x}, y)}\bigl[\mathcal{L}(h(\mathbf{x}), y)\bigr],$$

the expected loss under the underlying data distribution, the loss the model would incur, on average, if it were deployed forever. The true risk is what we want to minimise; the empirical risk is what we can minimise, because we only have a finite sample. The whole of generalisation theory, which we shall meet later in this chapter, is about the gap between the two: when do they agree, when do they diverge, and what kind of guarantees can we put on the difference?

A hypothesis with low empirical risk and low true risk is said to generalise. A hypothesis with low empirical risk but high true risk has overfitted, it has memorised quirks of the training set that are not features of the world. A hypothesis with high empirical risk has underfitted, its hypothesis class is too restrictive to capture the pattern in the data at all. Most of the practical art of machine learning consists of steering between these two failure modes.

It is worth pausing on each of the four ingredients of ERM, because beginners often conflate them. The hypothesis class $\mathcal{H}$ is a modelling decision, made by the human: do you believe the relationship between input and output is linear, tree-shaped, neural, or something else? The loss $\mathcal{L}$ is a value judgement: it encodes what you mean by "wrong", and different choices give materially different models from the same data. The training set $\mathcal{D}$ is a sampling decision: how representative your data are of the world you intend to deploy in determines almost everything that follows. Only the optimisation step, actually finding the $\hat h$ that minimises the average loss, is mechanical, and even there the algorithm matters because most realistic problems cannot be solved exactly and have to be approximated by gradient methods.

Hypothesis classes through the chapter

As you read on through the book, you will meet hypothesis classes of steadily growing capacity. Capacity is, loosely, the variety of functions a class can express; richer classes can model more intricate patterns, but at the cost of being easier to overfit and more demanding of data and compute.

Class Where in book Capacity
Linear models §6, §7 Low
Trees and ensembles §7 Medium
Kernel methods §7 High
Shallow neural networks §9 High
Deep networks §9–15 Very high

The progression in this textbook follows roughly historical and conceptual order. Linear models came first historically and are still the right starting point pedagogically: they are simple enough to analyse exactly, cheap enough to fit on a laptop, and surprisingly hard to beat in many settings. Trees and ensembles are the workhorses of tabular data and structured prediction. Kernel methods bridge linear models and the rich nonlinear world. Neural networks then take over, first shallow, then deep, ending with the gigantic transformer-based systems that power today's frontier AI. Each step trades simplicity and theoretical clarity for raw expressive power.

The ML workflow

Real machine-learning projects are not isolated mathematical exercises. They are, at their best, disciplined workflows, repeated many times, with each pass learning from the last. The canonical sequence runs:

  1. Frame the problem. Decide whether you are doing regression, classification, clustering, ranking, or something else, and state the success criterion in unambiguous terms.
  2. Collect data. Gather, clean, label, deduplicate, and document your dataset. Most of the work lives here.
  3. Split into train, validation, and test sets. Set the test set aside and do not look at it again until the very end.
  4. Choose a model class and a loss. Begin with a sensible baseline before reaching for anything sophisticated.
  5. Fit on the training data. Run your optimiser; monitor convergence; sanity-check the result.
  6. Tune on the validation data. Adjust hyper-parameters, regularisation strength, feature selection, architecture choices, all guided by validation performance.
  7. Evaluate on the test data. A single, clean number, reported once.
  8. Deploy and monitor. Watch for distribution drift, edge cases, fairness failures, and changes in the world that erode performance.

It is a near-universal observation that more than half of any production machine-learning project is spent on steps 2 to 4 and 8. Model fitting itself, the bit that gets the headlines, is rarely the bottleneck. The bottlenecks are getting hold of clean and representative data, framing the problem so that improvements in the metric correspond to improvements in the world, and keeping the system on-target after it ships. Newcomers to the field often arrive with the reverse intuition, expecting the modelling step to dominate; one of the most useful early adjustments is to flip that mental picture and budget your time accordingly.

The workflow is also iterative rather than linear. Step 7 frequently sends you back to step 1, because evaluation reveals that the original framing was wrong; step 8 sends you back to step 2, because the world has changed and your data are now stale; step 6 sends you back to step 4, because no choice of hyper-parameters rescues a model class that was not up to the task. Treating the eight steps as a one-way pipeline is one of the commonest causes of disappointing projects.

What this chapter covers

The rest of the chapter walks through paradigms (§6.2), the supervised setup (§6.3), generalisation (§6.4), optimisation and capacity (§§6.5–6.6), inductive bias (§6.7), model selection (§6.8), the curse of dimensionality (§6.9), pipeline discipline (§6.10), feature engineering versus representation learning (§6.11), modern phenomena (double descent, implicit regularisation; §§6.12–6.13), evaluation (§§6.14–6.16), and a worked end-to-end project (§6.17).

What you should take away

  1. Machine learning is the discipline of building systems whose behaviour is determined more by data than by hand-written rules, programs that get better at a task by being exposed to examples of it.
  2. The four working paradigms are supervised, unsupervised, reinforcement, and self-supervised learning, distinguished by what data the system sees and what kind of feedback it gets.
  3. The formal core of supervised learning is empirical risk minimisation: choose, from a stated hypothesis class, the function that minimises average loss on the training data.
  4. The hard problem is not minimising training loss; it is generalising to data the model has not seen, and the rest of the chapter is largely about how to do that honestly.
  5. Most of the work in real projects lives in framing, data, splits and deployment, not in the modelling step itself, however sophisticated the algorithm.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).