Exercises
The following exercises range from short conceptual checks to multi-hour mini-projects. Solutions to twelve of them follow.
Conceptual.
- State Mitchell's definition of learning. Apply it to: (a) a chess engine that improves by self-play; (b) a fixed look-up table of past chess positions and outcomes used to play new games. Which is "learning"?
- Give a concrete example, drawn from medicine, of a problem better handled by unsupervised than by supervised learning.
- Why is the i.i.d. assumption usually violated in time-series problems? Name two consequences.
- Distinguish between a model's hypothesis class, its parameters, and its hyperparameters. Give an example of each for a random forest.
- State the bias–variance decomposition for squared loss. What are the three terms? Which can be reduced by collecting more data?
- State the No Free Lunch theorem informally. Why does this not contradict the existence of useful learning algorithms?
- Define VC dimension. What is the VC dimension of: (a) thresholds on the real line; (b) linear classifiers in $\mathbb{R}^d$; (c) intervals on the real line?
- Explain the difference between weight decay and L2 regularisation in the loss when using Adam. Which is implemented by
torch.optim.AdamW? - What is the manifold hypothesis? Give one piece of empirical evidence for it.
- Explain why random search often outperforms grid search for hyperparameter tuning when only a few hyperparameters matter.
Computational.
- Implement linear regression in NumPy via the normal equations. On a synthetic dataset with $n = 200$, $d = 5$, and noise variance $\sigma^2 = 0.5^2$, plot training MSE and test MSE as a function of training-set size.
- Implement ridge regression and demonstrate that it shrinks the smallest-singular-value components of $\hat w$ more than the largest. Use SVD.
- Fit polynomial regressions of degree $d = 1, 2, 3, \ldots, 15$ on a small synthetic dataset (e.g. 30 points sampled from $y = \sin(x) + \epsilon$). Plot training MSE and test MSE against degree on the same axes; the gap widening past some sweet-spot degree is the canonical overfitting signature.
- Using
make_classificationfrom scikit-learn withn_classes=2,weights=[0.99, 0.01], build a classifier and report accuracy, F1, AUC-ROC, AUC-PR. Comment. - On the same imbalanced dataset, compare three approaches to imbalance: class-weighted loss, SMOTE oversampling, and post-hoc threshold tuning. Report AUC-PR and ECE.
- Implement 5-fold cross-validation from scratch (no scikit-learn shortcuts) for logistic regression on the breast-cancer dataset. Report mean and standard deviation of AUC.
- Implement nested cross-validation over the regularisation strength of an SVM. Report nested-CV AUC.
- Demonstrate target leakage on a synthetic dataset by including a feature that is computed from the label. Show how AUC drops to chance once the leak is removed.
- Reproduce the double-descent phenomenon for a fully connected MLP on a small subset of MNIST. Plot test error against width.
- Implement temperature scaling on a deep network's output. Compare ECE before and after.
Applied.
- Take a tabular Kaggle dataset of your choice. Document a pipeline that audits for: target leakage, group leakage, temporal leakage, and preprocessing leakage. Make any necessary corrections.
- Choose a clinical-prediction task from the MIMIC-III or eICU databases. Build a model. Report AUC, AUC-PR, ECE. Compare against the published benchmark for that task.
- Train an autoencoder on a low-resolution image dataset. Visualise the bottleneck representation with PCA and t-SNE. Discuss what is preserved and what is lost.
- Evaluate the calibration of a publicly available open-source LLM on a multiple-choice benchmark. Apply temperature scaling. Compare ECE before and after.
- Take any benchmark from §6.16 and quantify the contamination by querying a model with the question and asking it to recall the answer verbatim. Report the recall rate.
Theoretical.
- Prove that the expected squared error of the conditional mean $\mathbb{E}[y \mid x]$ is the lower bound for any predictor. (Tower property.)
- Show that ridge regression with regulariser $\lambda$ is equivalent to MAP estimation under a Gaussian prior $w \sim \mathcal{N}(0, \tau^2 I)$ for some $\tau$. Express $\tau$ in terms of $\lambda$ and the noise variance.
- Show that early stopping for gradient descent on a quadratic loss is equivalent to L2 regularisation. What is the equivalent regularisation strength after $t$ steps?
- Show that maximising AUC-ROC is equivalent to minimising the Wilcoxon–Mann–Whitney rank loss.
- Prove that for a binary classifier, log-loss is a strict proper scoring rule (i.e., its expectation is uniquely minimised by the true posterior probability).