ML Fundamentals: Selected solutions

Dr Chris Paton

Selected solutions

Solution to Exercise 1. Mitchell's definition: "A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$." (a) The chess engine that improves by self-play: $T$ = playing chess, $P$ = win rate, $E$ = self-play games. Performance improves with experience. This is learning. (b) The look-up table: $T$ = playing chess, $P$ = win rate, $E$ = past games. The table does not generalise. When given a position not in the table, it has no answer. While its coverage may grow with experience, this is rote memorisation, not learning in Mitchell's sense, because $P$ does not improve on novel positions. The question highlights the point that learning requires generalisation, not just storage.

Solution to Exercise 3. In a time-series problem, the value of the target at time $t$ is typically correlated with values at $t-1$, $t-2$, and so on. The samples are not independent. Two consequences: (a) Random k-fold cross-validation gives an over-optimistic estimate of generalisation, because future-fold validation points are well-predicted by training-fold points that happen to be a few timesteps away. Use forward-chaining (walk-forward) CV instead. (b) Confidence intervals computed assuming independent samples are too narrow. Block bootstrapping or model-based variance estimation is required.

Solution to Exercise 5. The bias–variance decomposition for squared loss at a fixed input $x$ is $$ \mathbb{E}_{\mathcal{D}, \varepsilon}\big[(y - \hat f_{\mathcal{D}}(x))^2\big] = \text{bias}^2 + \text{variance} + \sigma^2. $$ Bias measures the gap between the average predictor and the true function. Variance measures the predictor's sensitivity to the training set. $\sigma^2$ is irreducible noise. Collecting more (i.i.d.) data shrinks variance but does not change bias or noise; only choosing a more flexible model can reduce bias.

Solution to Exercise 7. The VC dimension of: (a) Thresholds on the real line, $h_t(x) = \mathbb{1}[x > t]$, is 1. We can shatter one point but not two, given two points $a < b$, we cannot achieve the labelling $(1, 0)$, because that would require $h$ to be 1 below $a$ and 0 above $b$, impossible for a threshold. (b) Linear classifiers in $\mathbb{R}^d$ have VC dimension $d + 1$, achieved by $d + 1$ points in general position; any $d + 2$ points must include some configuration that cannot be shattered. (c) Intervals $[a, b]$ on the real line have VC dimension 2: shatter $\{0, 1\}$ with intervals $\emptyset$, $[-1, 0.5]$, $[0.5, 2]$, $[-1, 2]$ to get all four labellings. But three points $a < b < c$ cannot be labelled $(1, 0, 1)$ by any single interval.

Solution to Exercise 8. With Adam, the gradient of a parameter $w_i$ is divided by an estimate of its standard deviation, $\hat\sigma_i$. Adding $\frac{\lambda}{2} w_i^2$ to the loss (the §6.6 convention) adds $\lambda w_i$ to the gradient, so the effective decay step is $\eta \cdot \lambda w_i / \hat\sigma_i$, different for each parameter. AdamW Loshchilov, 2017 decouples weight decay from the gradient: it applies the multiplicative shrinkage $w_i \leftarrow w_i (1 - \eta \lambda)$ directly, irrespective of the gradient. The two are equivalent only for plain SGD; they are not equivalent for Adam. torch.optim.AdamW implements decoupled decay.

Solution to Exercise 11.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
d = 5
true_w = rng.normal(size=d)

def fit_predict(n_train):
    X_tr = rng.normal(size=(n_train, d))
    y_tr = X_tr @ true_w + rng.normal(scale=0.5, size=n_train)
    X_te = rng.normal(size=(2000, d))
    y_te = X_te @ true_w + rng.normal(scale=0.5, size=2000)
    w = np.linalg.solve(X_tr.T @ X_tr, X_tr.T @ y_tr)
    train_mse = np.mean((X_tr @ w - y_tr)**2)
    test_mse  = np.mean((X_te @ w - y_te)**2)
    return train_mse, test_mse

ns = np.arange(10, 500, 10)
trains, tests = zip(*[fit_predict(n) for n in ns])
plt.plot(ns, trains, label="train")
plt.plot(ns, tests,  label="test")
plt.axhline(0.25, color="grey", linestyle="--", label=r"$\sigma^2$")
plt.xlabel("training-set size"); plt.ylabel("MSE"); plt.legend()

The training MSE rises with $n$ (because there is no longer a tiny number of points the model can fit perfectly) and converges to $\sigma^2 = 0.25$. Test MSE falls and converges to the same value.

Solution to Exercise 12.

import numpy as np

rng = np.random.default_rng(0)
n, d = 100, 50
X = rng.normal(size=(n, d))
true_w = rng.normal(size=d)
y = X @ true_w + rng.normal(scale=1.0, size=n)

U, s, Vt = np.linalg.svd(X, full_matrices=False)
def ridge(X, y, lam):
    return Vt.T @ np.diag(s / (s**2 + lam)) @ U.T @ y

w_ols   = ridge(X, y, 0.0)
w_ridge = ridge(X, y, 5.0)

print("singular value and ridge shrinkage factor:")
for i in range(d):
    print(f"s[{i}] = {s[i]:.3f}  shrinkage = {s[i]**2 / (s[i]**2 + 5.0):.3f}")

The shrinkage factor $s_i^2 / (s_i^2 + \lambda)$ approaches 1 for large $s_i$ and 0 for small $s_i$. Ridge therefore shrinks low-variance directions (which carry mostly noise) more aggressively than high-variance ones, exactly the right behaviour.

Solution to Exercise 14.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, f1_score,
                             roc_auc_score, average_precision_score)

X, y = make_classification(n_samples=20000, weights=[0.99, 0.01],
                           n_informative=8, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=0)
clf = LogisticRegression(max_iter=2000).fit(X_tr, y_tr)
proba = clf.predict_proba(X_te)[:, 1]
pred = (proba >= 0.5).astype(int)

print(f"Accuracy: {accuracy_score(y_te, pred):.4f}")
print(f"F1:       {f1_score(y_te, pred):.4f}")
print(f"AUC-ROC:  {roc_auc_score(y_te, proba):.4f}")
print(f"AUC-PR:   {average_precision_score(y_te, proba):.4f}")

Accuracy is around 0.99, but a constant predictor of 0 also achieves 0.99. F1 and AUC-PR are far more informative. AUC-ROC will look high (0.95+) because of the fundamental asymmetry of the ROC curve under imbalance; AUC-PR is the more revealing summary.

Solution to Exercise 16.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)
rng = np.random.default_rng(0)
order = rng.permutation(len(y))
X, y = X[order], y[order]
folds = np.array_split(np.arange(len(y)), 5)

scores = []
for k, val_idx in enumerate(folds):
    train_idx = np.concatenate([folds[j] for j in range(5) if j != k])
    sc = StandardScaler().fit(X[train_idx])
    Xtr = sc.transform(X[train_idx]); Xva = sc.transform(X[val_idx])
    clf = LogisticRegression(max_iter=2000).fit(Xtr, y[train_idx])
    scores.append(roc_auc_score(y[val_idx], clf.predict_proba(Xva)[:, 1]))

print(f"AUC: {np.mean(scores):.4f} +/- {np.std(scores):.4f}")

Note the scaler is refit on each training fold. A typical run gives AUC around 0.997 on this easy dataset.

Solution to Exercise 18.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

rng = np.random.default_rng(0)
n, d = 5000, 10
X = rng.normal(size=(n, d))
y = (X[:, 0] + 0.3 * rng.normal(size=n) > 0).astype(int)

# Add a feature that is the label plus tiny noise, leakage
leak = y + 0.05 * rng.normal(size=n)
X_leaky = np.hstack([X, leak[:, None]])

for name, X_use in [("clean", X), ("leaky", X_leaky)]:
    Xtr, Xte, ytr, yte = train_test_split(X_use, y, random_state=0)
    auc = roc_auc_score(yte, LogisticRegression(max_iter=1000)
                              .fit(Xtr, ytr).predict_proba(Xte)[:, 1])
    print(f"{name} AUC: {auc:.4f}")

The leaky version reports near-perfect AUC. Removing the leak drops the AUC to its actual level. The lesson: extraordinary performance from a feature you did not design for that purpose is grounds for suspicion.

Solution to Exercise 26. Let $f(x) = \mathbb{E}[y \mid x]$. For any predictor $g(x)$, $$ \mathbb{E}[(y - g(x))^2] = \mathbb{E}[(y - f(x) + f(x) - g(x))^2]. $$ Expand: $$ = \mathbb{E}[(y - f(x))^2] + \mathbb{E}[(f(x) - g(x))^2] + 2\mathbb{E}[(y - f(x))(f(x) - g(x))]. $$ The cross term is zero by the tower property: conditioning on $x$, $\mathbb{E}[y - f(x) \mid x] = 0$ by definition of $f$, and $f(x) - g(x)$ is $x$-measurable. The second term is non-negative and zero iff $g = f$. Hence $$ \mathbb{E}[(y - g(x))^2] \ge \mathbb{E}[(y - f(x))^2] $$ with equality iff $g = f$ almost surely. $\square$

Solution to Exercise 27. The likelihood $p(y \mid X, w) = \mathcal{N}(y; Xw, \sigma^2 I)$ gives the negative log-likelihood $$ -\log p = \frac{1}{2\sigma^2} \|y - Xw\|^2 + \text{const}. $$ The Gaussian prior $p(w) = \mathcal{N}(0, \tau^2 I)$ gives $$ -\log p(w) = \frac{1}{2\tau^2} \|w\|^2 + \text{const}. $$ The MAP objective is the sum: $$ \frac{1}{2\sigma^2}\|y - Xw\|^2 + \frac{1}{2\tau^2}\|w\|^2. $$ Multiplying by $2\sigma^2$ gives $\|y - Xw\|^2 + \frac{\sigma^2}{\tau^2}\|w\|^2$, identifying $\lambda = \sigma^2/\tau^2$. Thus a strong prior (small $\tau$) is equivalent to strong regularisation (large $\lambda$). $\square$

Solution to Exercise 28. Consider gradient descent on a quadratic loss $L(w) = \frac{1}{2}(w - w^*)^\top H (w - w^*)$ with positive-definite $H$. Diagonalising in the eigenbasis of $H$ ($H = U \Lambda U^\top$, $\tilde w = U^\top w$), each component evolves independently as $$ \tilde w^{(t)}_i = (1 - \eta \lambda_i)^t \tilde w^{(0)}_i + \big(1 - (1 - \eta \lambda_i)^t\big) \tilde w^*_i. $$ Comparing to the L2-regularised solution $\tilde w_i^{\text{ridge}} = \frac{\lambda_i}{\lambda_i + \mu} \tilde w^*_i$, the two coincide when $$ 1 - (1 - \eta \lambda_i)^t = \frac{\lambda_i}{\lambda_i + \mu}. $$ For small $\eta\lambda_i$, $(1 - \eta\lambda_i)^t \approx \exp(-t\eta\lambda_i)$, giving $\mu \approx \frac{1}{t\eta} - \lambda_i$ for each eigendirection. Although the equivalence is only approximate (the regularisation is not isotropic across directions), it captures the principle: longer training $\Leftrightarrow$ smaller effective $\mu$ $\Leftrightarrow$ less regularisation. $\square$

Solution to Exercise 29. The Wilcoxon–Mann–Whitney $U$ statistic for a scoring function $s$ on positives $\{x^+_i\}$ ($n_+$ of them) and negatives $\{x^-_j\}$ ($n_-$) is $$ U = \frac{1}{n_+ n_-} \sum_{i,j} \mathbb{1}[s(x^+_i) > s(x^-_j)] + \tfrac{1}{2} \mathbb{1}[s(x^+_i) = s(x^-_j)]. $$ This is precisely an empirical estimate of $\Pr[s(X^+) > s(X^-)]$ for randomly drawn positive and negative examples, which is the probabilistic interpretation of AUC-ROC. Therefore minimising the rank loss $1 - U$ over a function class is equivalent to maximising AUC. $\square$

Solution to Exercise 30. A scoring rule $S(\hat p, y)$ is strictly proper if $\mathbb{E}_{y \sim p^*}[S(\hat p, y)]$ is uniquely minimised in $\hat p$ by $\hat p = p^*$. For log-loss, $$ \mathbb{E}_{y \sim p^*}[-y \log \hat p - (1-y)\log(1-\hat p)] = -p^* \log \hat p - (1-p^*)\log(1-\hat p). $$ Differentiating with respect to $\hat p$: $$ \frac{\partial}{\partial \hat p} = -\frac{p^*}{\hat p} + \frac{1 - p^*}{1 - \hat p} = \frac{\hat p - p^*}{\hat p (1 - \hat p)}. $$ This is zero iff $\hat p = p^*$, negative for $\hat p < p^*$ and positive for $\hat p > p^*$, a unique minimum at $\hat p = p^*$. The expectation is therefore strictly minimised at the truth. $\square$