- State the supervised ML framework in terms of data, hypothesis class, loss function, and optimisation
- Engineer features and representations appropriate to the data modality and model
- Evaluate models using metrics such as accuracy, precision, recall, F1 score, and ROC/AUC
- Apply regularisation (L1, L2, early stopping, dropout) to control overfitting and improve generalisation
- Use train/validation/test splits and k-fold cross-validation to estimate generalisation error honestly
You want to build a spam filter. The traditional approach is to write rules by hand: if the email contains "lottery" and "click here," flag it. But spammers adapt. Your rules go stale within weeks.
Machine learning takes a different approach. You show the algorithm thousands of emails, each labelled "spam" or "not spam," and it figures out the rules on its own. When spammers change tactics, you retrain on fresh examples. The filter adapts because the algorithm learns from data, not from your guesses about what spam looks like.
This chapter covers the ideas that make this work. You will learn the formal framework behind all ML algorithms, how to prepare features for a model, how to measure whether a model is actually useful, how to prevent overfitting, and how to estimate performance honestly with cross-validation. For deeper treatment, see Hastie, Tibshirani, and Friedman Hastie, 2009, Bishop Bishop, 2006, Murphy Murphy, 2022, and Goodfellow, Bengio, and Courville Goodfellow, 2016.
6.1 The ML Framework
Three Paradigms
Machine learning divides into three paradigms based on the feedback the algorithm gets:
- Supervised learning: you provide input–output pairs {(x
i, yi)}. The algorithm learns a function f that maps inputs to outputs. If the output is a category (spam or not spam), it is classification. If the output is a number (house price), it is regression. - Unsupervised learning: no labels are provided. The algorithm finds structure on its own — clusters of similar data points, or compact representations.
- Reinforcement learning: an agent takes actions in an environment and receives reward signals. It learns a policy that maximises cumulative reward over time.
Many modern systems combine elements of all three.
Hypothesis Space
Within any paradigm, you must choose a hypothesis space H — the set of candidate functions the model can learn. For linear models, H contains all functions f(x) = w^T^x + b. For decision trees, it contains all axis-aligned partitions. For a neural network, it contains all functions achievable by varying the weights.
The choice of hypothesis space is your inductive bias — your assumption about the shape of the true function. Too restrictive and the model cannot capture real patterns (underfitting). Too flexible and it fits noise (overfitting). The art of applied ML is finding the right balance.
Loss Functions
Learning works by minimising a loss function that measures how far the model's predictions are from the truth. Common choices:
- Mean squared error (regression): L = (1/n) Σ
i(yi− f(xi))^2^ - Cross-entropy (classification): L = −(1/n) Σ
i[yilog f(xi) + (1 − yi) log(1 − f(xi))]
The choice of loss encodes what you mean by "good." Squared error penalises large mistakes heavily. Absolute error is more robust to outliers. Hinge loss encourages wide-margin classifiers.
Optimisation
Gradient descent is the main optimisation tool. Starting from initial parameters, it iteratively updates: w ← w − η ∇L(w), where η is the learning rate.
Stochastic gradient descent (SGD) Robbins, 1951 estimates the gradient from a random mini-batch instead of the full dataset. This cuts the cost per step from O(n) to O(b), where b is the batch size. Adaptive methods like Adam Kingma, 2014 adjust the learning rate per parameter, speeding convergence on difficult loss surfaces.
For convex problems (like logistic regression), gradient descent finds the global optimum. For non-convex problems (like neural networks), it finds a local minimum or saddle point — but this is usually good enough in practice.
Training Error vs Generalisation Error
The goal of ML is to perform well on unseen data, not just on the training set. A model that memorises every training example achieves zero training error but may generalise poorly. That is overfitting.
Statistical learning theory makes this precise. The VC dimension measures the capacity of a hypothesis space. PAC (Probably Approximately Correct) bounds relate generalisation error to training error, model complexity, and sample size. These guarantees say: with enough data and an appropriately constrained model, learning will work.
The Full Pipeline
Model fitting is just one step. A real ML project involves:
- Data collection and cleaning
- Exploratory analysis
- Feature engineering
- Model selection and hyperparameter tuning
- Evaluation on a held-out test set
- Deployment with monitoring
Pitfalls lurk at every stage. Data leakage means accidentally including test-set information during training. Distribution shift means the deployment data looks different from the training data. Feedback loops mean the deployed model influences the data it later trains on. Understanding the full framework is your best defence.
6.2 Features & Representations
Models do not operate on raw observations. They operate on features — numbers extracted from the data. The quality of your features often matters more than the choice of model. Good features can make a simple linear model highly effective. Bad features can doom even the most powerful neural network.
Raw Data Types
Data comes in many forms:
- Tabular: rows of numbers and categories
- Images: pixel arrays
- Text: sequences of characters or tokens
- Audio: waveforms
- Graphs: adjacency matrices
Each needs a different transformation into numbers.
Encoding
For categorical variables, one-hot encoding creates a binary indicator per category. Ordinal encoding assigns integers that respect a natural order.
For text, classical approaches include bag-of-words (counting word frequencies) and TF-IDF (down-weighting common words). Modern transformer models learn their own representations directly from raw text.
For images, early systems used hand-crafted features like histograms of oriented gradients (HOG) and SIFT descriptors. Deep convolutional networks like AlexNet Krizhevsky, 2012 largely replaced these with learned features.
The goal is always the same: produce a fixed-length numerical vector that captures what matters for the task and discards what does not.
Feature Scaling
Features on different scales (age in years vs income in pounds) cause problems. Gradient-based optimisers converge slowly when the loss surface is elongated.
Common scaling methods:
- Standardisation: subtract the mean, divide by the standard deviation. Result: zero mean, unit variance.
- Min–max normalisation: rescale to [0, 1].
- Robust scaling: use the median and IQR instead of mean and standard deviation. Less sensitive to outliers.
Scaling matters most for distance-based methods (KNN, SVMs) and for neural networks, where batch normalisation has become standard.
Feature Selection
Not all features help. Some are irrelevant. Some are redundant. Removing them improves interpretability, reduces compute, and can improve generalisation by cutting noise.
Three approaches:
- Filter methods: rank features by a statistical criterion (mutual information, chi-squared, correlation with target) and keep the top ones.
- Wrapper methods: try different feature subsets, train a model on each, and measure performance. Forward selection, backward elimination, and genetic algorithms search the space.
- Embedded methods: the model does selection during training. L1 regularisation drives uninformative weights to zero. Tree-based models score features by how much they reduce impurity.
Feature Extraction
Feature extraction creates new features from old ones, often reducing dimensionality:
- PCA (principal component analysis): projects data onto directions of maximal variance. Produces uncorrelated features.
- LDA (linear discriminant analysis): projects data to maximise class separation. Supervised alternative to PCA.
- t-SNE Maaten, 2008 and UMAP McInnes, 2018: non-linear methods that preserve local structure. Great for visualising high-dimensional data.
- Autoencoders: neural networks trained to reconstruct their input through a bottleneck. The bottleneck layer gives a compressed representation.
Representation Learning
Deep learning has reduced the need for manual feature engineering. CNNs learn visual features from pixels. Transformers learn word representations from text. Graph neural networks learn node features from adjacency structure.
This shift is called representation learning. The model jointly learns the features and the prediction function, end-to-end. But understanding feature engineering still matters. It guides architecture choices, informs data augmentation strategies, and remains essential in domains where data is scarce.
6.3 Model Evaluation
A model is only useful if it generalises to unseen data. Evaluation measures that ability.
Train/Validation/Test Split
The simplest approach splits data into three parts:
- Training set: used to fit the model.
- Validation set: used to tune hyperparameters and choose between models.
- Test set: used once, at the end, for a final performance estimate.
The test set must be strictly held out during all development. Using it for any decision during training is data leakage — the most common and most damaging mistake in applied ML.
Classification Metrics
Accuracy — the fraction of correct predictions — is intuitive but misleading with imbalanced classes. If 99% of emails are legitimate, a model that always says "not spam" gets 99% accuracy while being useless.
Better metrics:
- Precision: of all positive predictions, how many are correct?
- Recall: of all actual positives, how many did the model catch?
- F1 score: harmonic mean of precision and recall.
- Confusion matrix: tabulates true positives, true negatives, false positives, and false negatives.
For multi-class problems, you can macro-average (compute per-class, then average) or micro-average (pool all predictions first).
Threshold-Independent Metrics
The ROC curve plots true-positive rate against false-positive rate as the classification threshold varies. The AUC-ROC (area under the curve) ranges from 0.5 (random) to 1.0 (perfect).
Under severe class imbalance, the precision–recall curve and its AUC give a more honest picture.
Calibration curves check whether predicted probabilities match reality. A model that predicts 70% probability of rain should see rain about 70% of the time. Good calibration is essential when probabilities feed into decisions, as in medicine and risk assessment.
Regression Metrics
- MSE and RMSE: average squared deviation. Sensitive to outliers.
- MAE: average absolute deviation. More robust.
- R² = 1 − (SS
res/ SStot): proportion of variance explained. R² = 1 is perfect; R² = 0 means no better than predicting the mean; negative means worse. - MAPE: percentage error. Scale-independent but undefined when true values are zero.
Choose the metric that reflects the real-world cost of different error types.
Statistical Rigour
A single train–test split can give misleading results. Cross-validation (Section 6.5) addresses this. When comparing two models, a paired t-test or Wilcoxon signed-rank test on per-fold scores tests whether the difference is statistically significant. Bootstrap confidence intervals provide another route. Always report both a point estimate and an interval.
6.4 Regularisation
Without regularisation, a flexible model fits both the real patterns and the noise in the training data. Training error is low, but test error is high. That is overfitting. Regularisation constrains the model to prefer simpler solutions, pushing the bias–variance trade-off toward lower total error.
L2 Regularisation (Ridge / Weight Decay)
Add a penalty proportional to the squared norm of the weights:
Lreg = L + λ ‖w‖^2^
This discourages large weights, shrinking them toward zero without setting any exactly to zero. The Bayesian interpretation (Chapter 4) is a Gaussian prior on the parameters: the penalty is the negative log-prior. The hyperparameter λ controls the trade-off. Large λ means more bias, less variance. Small λ means less bias, more variance. Choose λ by cross-validation.
L1 Regularisation (Lasso)
Replace the squared norm with the absolute-value norm:
Lreg = L + λ ‖w‖1
The geometry of the L1 ball (a diamond in 2D) means the optimum often sits at a corner where some weights are exactly zero. L1 regularisation performs feature selection, producing sparse models.
The elastic net combines both penalties, blending sparsity with the grouping effect (correlated features are selected or dropped together). Both L1 and L2 keep the objective convex when the original loss is convex.
Deep Learning Regularisation
Neural networks use additional techniques beyond explicit penalties:
- Dropout Srivastava, 2014: randomly sets each neuron's output to zero with probability p during training. The network must develop redundant representations. At test time, outputs are scaled by (1 − p). This approximates Bayesian inference over an ensemble of sub-networks.
- Batch normalisation Ioffe, 2015: normalises activations to zero mean and unit variance within each mini-batch. Reduces internal covariate shift and acts as an implicit regulariser.
- Data augmentation: applies label-preserving transformations (crops, flips, rotations, colour jitter) to training images. This effectively increases the training set and discourages reliance on superficial features.
Early Stopping
The simplest regulariser: stop training when validation loss starts rising, even though training loss keeps falling. The number of training iterations acts as a complexity control. Formally, early stopping on a quadratic loss is equivalent to L2 regularisation, with strength depending on the learning rate and iteration count.
Modern Techniques
- Label smoothing Szegedy, 2016: replaces hard one-hot targets with a softened mix, preventing overconfidence.
- Mixup Zhang, 2017: creates synthetic examples by blending pairs of inputs and labels, encouraging linear boundaries between classes.
- Weight tying: shares parameters between parts of a network (e.g., encoder and decoder embeddings), reducing free parameters.
- Spectral normalisation: constrains the spectral norm of weight matrices, controlling the network's Lipschitz constant and stabilising GAN training.
All of these reflect the same principle: restrict the functions the model can represent, so it favours solutions that generalise.
6.5 Cross-Validation
When data is limited, a single train–test split is unreliable. Cross-validation gives you a more robust estimate by using every data point for both training and testing.
K-Fold Cross-Validation
Split the data into k equal folds. For each fold:
- Hold out that fold as the validation set.
- Train on the remaining k − 1 folds.
- Record the validation performance.
The final estimate is the average across all k folds. Common choices are k = 5 or k = 10. The extreme case k = n is leave-one-out (LOOCV): nearly unbiased but can have high variance and is expensive for large datasets.
Specialised Variants
Standard k-fold does not always work:
- Stratified k-fold: preserves the class distribution in each fold. Essential for imbalanced datasets.
- Time-series CV: trains on data up to a cutoff and tests on the next window, advancing the cutoff each round. Respects temporal ordering.
- Group k-fold: keeps all observations from one group (e.g., one patient) in the same fold. Prevents leakage from within-group correlation.
Choosing the wrong CV scheme produces optimistic estimates that do not reflect real-world performance.
Model Selection vs Performance Estimation
Cross-validation serves two purposes. Model selection uses it to pick the best model or hyperparameters. Performance estimation uses it to report how well the chosen model generalises. Using the same CV procedure for both biases the estimate upward.
Nested cross-validation solves this. An inner loop selects the best model. An outer loop evaluates it on held-out data. The outer-fold averages give an approximately unbiased generalisation estimate.
Computational Constraints
Full k-fold CV for every hyperparameter of a deep network may be infeasible. Practical strategies:
- Use a single validation split for initial exploration. Reserve CV for final comparison.
- Apply early stopping within each fold.
- Use smaller models or subsampled data for hyperparameter search, then validate the final choice on full data.
- Bayesian optimisation and Hyperband allocate compute to the most promising candidates.
Limitations
Cross-validation is not perfect. Variance can be substantial with small datasets. Fold-level scores are not independent (they share training data), so confidence intervals are approximate. Methods like the corrected resampled t-test of Nadeau and Bengio adjust for this but are not exact.
Cross-validation estimates the performance of the model-fitting procedure, not the specific model you will train on all the data. Treat CV scores as informative guides. Complement them with held-out test evaluation and, where possible, domain-specific validation — such as prospective clinical trials for medical AI.