ML Fundamentals: Exercises

Dr Chris Paton

Exercises

The following exercises range from short conceptual checks to multi-hour mini-projects. Solutions to twelve of them follow.

Conceptual.

State Mitchell's definition of learning. Apply it to: (a) a chess engine that improves by self-play; (b) a fixed look-up table of past chess positions and outcomes used to play new games. Which is "learning"?
Give a concrete example, drawn from medicine, of a problem better handled by unsupervised than by supervised learning.
Why is the i.i.d. assumption usually violated in time-series problems? Name two consequences.
Distinguish between a model's hypothesis class, its parameters, and its hyperparameters. Give an example of each for a random forest.
State the bias–variance decomposition for squared loss. What are the three terms? Which can be reduced by collecting more data?
State the No Free Lunch theorem informally. Why does this not contradict the existence of useful learning algorithms?
Define VC dimension. What is the VC dimension of: (a) thresholds on the real line; (b) linear classifiers in $\mathbb{R}^d$; (c) intervals on the real line?
Explain the difference between weight decay and L2 regularisation in the loss when using Adam. Which is implemented by torch.optim.AdamW?
What is the manifold hypothesis? Give one piece of empirical evidence for it.
Explain why random search often outperforms grid search for hyperparameter tuning when only a few hyperparameters matter.

Computational.

Implement linear regression in NumPy via the normal equations. On a synthetic dataset with $n = 200$, $d = 5$, and noise variance $\sigma^2 = 0.5^2$, plot training MSE and test MSE as a function of training-set size.
Implement ridge regression and demonstrate that it shrinks the smallest-singular-value components of $\hat w$ more than the largest. Use SVD.
Fit polynomial regressions of degree $d = 1, 2, 3, \ldots, 15$ on a small synthetic dataset (e.g. 30 points sampled from $y = \sin(x) + \epsilon$). Plot training MSE and test MSE against degree on the same axes; the gap widening past some sweet-spot degree is the canonical overfitting signature.
Using make_classification from scikit-learn with n_classes=2, weights=[0.99, 0.01], build a classifier and report accuracy, F1, AUC-ROC, AUC-PR. Comment.
On the same imbalanced dataset, compare three approaches to imbalance: class-weighted loss, SMOTE oversampling, and post-hoc threshold tuning. Report AUC-PR and ECE.
Implement 5-fold cross-validation from scratch (no scikit-learn shortcuts) for logistic regression on the breast-cancer dataset. Report mean and standard deviation of AUC.
Implement nested cross-validation over the regularisation strength of an SVM. Report nested-CV AUC.
Demonstrate target leakage on a synthetic dataset by including a feature that is computed from the label. Show how AUC drops to chance once the leak is removed.
Reproduce the double-descent phenomenon for a fully connected MLP on a small subset of MNIST. Plot test error against width.
Implement temperature scaling on a deep network's output. Compare ECE before and after.

Applied.

Take a tabular Kaggle dataset of your choice. Document a pipeline that audits for: target leakage, group leakage, temporal leakage, and preprocessing leakage. Make any necessary corrections.
Choose a clinical-prediction task from the MIMIC-III or eICU databases. Build a model. Report AUC, AUC-PR, ECE. Compare against the published benchmark for that task.
Train an autoencoder on a low-resolution image dataset. Visualise the bottleneck representation with PCA and t-SNE. Discuss what is preserved and what is lost.
Evaluate the calibration of a publicly available open-source LLM on a multiple-choice benchmark. Apply temperature scaling. Compare ECE before and after.
Take any benchmark from §6.16 and quantify the contamination by querying a model with the question and asking it to recall the answer verbatim. Report the recall rate.

Theoretical.

Prove that the expected squared error of the conditional mean $\mathbb{E}[y \mid x]$ is the lower bound for any predictor. (Tower property.)
Show that ridge regression with regulariser $\lambda$ is equivalent to MAP estimation under a Gaussian prior $w \sim \mathcal{N}(0, \tau^2 I)$ for some $\tau$. Express $\tau$ in terms of $\lambda$ and the noise variance.
Show that early stopping for gradient descent on a quadratic loss is equivalent to L2 regularisation. What is the equivalent regularisation strength after $t$ steps?
Show that maximising AUC-ROC is equivalent to minimising the Wilcoxon–Mann–Whitney rank loss.
Prove that for a binary classifier, log-loss is a strict proper scoring rule (i.e., its expectation is uniquely minimised by the true posterior probability).