Supervised Learning: 7.16   Exercises

Dr Chris Paton

7.16 Exercises

A mix of analytical, computational, and conceptual problems. Solutions to a representative selection are given at the end.

Linear regression.

Show that the OLS estimator is unbiased: $\mathbb{E}[\hat{\mathbf{w}}] = \mathbf{w}$ under the linear model with zero-mean errors.
Derive the variance of $\hat{\mathbf{w}}$ in OLS in the homoskedastic Gaussian model.
Show that ridge regression is the MAP estimator under a zero-mean Gaussian prior on $\mathbf{w}$. State the relationship between $\lambda$ and the prior variance.
Explain geometrically why the lasso produces sparse solutions while ridge does not. Hint: think about the level sets of the penalties and their intersections with the elliptical level sets of the squared-error loss.
Implement coordinate descent for the lasso and verify it on a small simulated dataset against sklearn.linear_model.Lasso.
Construct a synthetic dataset where two features are perfectly correlated. Compute and compare $\hat{\mathbf{w}}$ for OLS, ridge, and lasso. What do you observe?

Logistic regression and GLMs.

Derive the gradient and Hessian of the binary cross-entropy with $L_2$ penalty. Confirm convexity.
Show that the softmax output is invariant to adding a constant to every logit. What is the practical implication for numerical implementation?
Derive the log-likelihood, gradient, and IRLS update for Poisson regression with log link.
Show that under a Bernoulli model with logit link, the canonical link makes the Hessian positive semi-definite for any $\mathbf{X}$.
Implement multinomial logistic regression from scratch with mini-batch SGD and compare against scikit-learn on the iris dataset.

KNN and decision trees.

Prove that as $n\to\infty$, the 1-NN error rate is at most twice the Bayes error rate (state any required regularity conditions).
Show that, for a node with class proportions $(p, 1-p)$, the Gini impurity is $2p(1-p)$ and the Shannon entropy is $-p\log_2 p - (1-p)\log_2(1-p)$. Plot both as functions of $p$.
Build a decision tree by hand on the dataset $\{(1,0), (2,0), (3,1), (4,1), (5,0), (6,1)\}$ using information gain. Show every step.
Implement cost-complexity pruning on top of the tree code in this chapter and verify on sklearn.datasets.load_breast_cancer.

Naive Bayes.

Show that, under conditional independence, the posterior log-odds $\log P(y=1\mid \mathbf{x})/P(y=0\mid \mathbf{x})$ is a sum of per-feature log-likelihood ratios. What does this imply about the decision boundary for Gaussian Naive Bayes with shared variance?
Implement multinomial naive Bayes with Laplace smoothing and apply it to the 20 Newsgroups dataset. Compare against logistic regression with TF-IDF features.

SVMs.

Derive the dual of the soft-margin SVM and show the constraints become $0\leq \alpha_i\leq C$.
Show that the RBF kernel $K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$ corresponds to an inner product in an infinite-dimensional feature space. Hint: expand the exponential as a power series.
Implement an SVM dual solver via SMO on a small dataset and compare against sklearn.svm.SVC.
Show that the hinge-loss formulation is equivalent to the constrained primal of the soft-margin SVM.

Ensembles.

Show that averaging $M$ uncorrelated unbiased estimators reduces variance by a factor of $M$, and write down the corresponding result when correlations are non-zero.
Show that AdaBoost is forward stagewise additive modelling with exponential loss.
Show that, for log-loss, gradient boosting's pseudo-residual is $y_i - p_i$ where $p_i = \sigma(F(\mathbf{x}_i))$.
Implement bagging on top of your decision-tree code. Verify on the breast-cancer dataset that variance drops with the number of bags.

Evaluation.

Construct an example of two classifiers with the same accuracy but very different F1, precision, and recall. Discuss when each metric should drive the decision.
Show that AUC-ROC equals the probability that a randomly chosen positive scores higher than a randomly chosen negative.
A spam filter has prevalence 0.001, recall 0.95, and specificity 0.99. Compute precision, F1, and posterior odds of spam given a positive prediction. Comment.

Practical.

Take the California housing dataset. Fit linear, ridge, lasso, random forest, and gradient boosting regressions with proper preprocessing and 5-fold CV. Report MSE, MAE, $R^2$. Which model wins, and why?
Take any imbalanced dataset (class ratio at least 1:20). Compare class-weighting, SMOTE, and threshold tuning on the same model. Plot ROC and precision–recall curves.
Build a multi-class one-vs-rest pipeline by hand using only a binary classifier. Verify it matches scikit-learn's behaviour on iris.
Set up a calibration experiment: train a random forest, a logistic regression, and an SVM. Plot reliability diagrams. Apply Platt scaling and isotonic regression and re-plot.

Selected solution sketches

1. Unbiasedness of OLS. $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\mathbf{w}+\boldsymbol\varepsilon) = \mathbf{w} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol\varepsilon$. Taking expectations and using $\mathbb{E}[\boldsymbol\varepsilon]=\mathbf{0}$ gives $\mathbb{E}[\hat{\mathbf{w}}]=\mathbf{w}$.

3. Ridge as MAP. With prior $\mathbf{w}\sim\mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$ and Gaussian noise $\sigma^2$, the log-posterior is $-\frac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 - \frac{1}{2\tau^2}\|\mathbf{w}\|^2 + \text{const}$. Maximising is equivalent to minimising $\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 + \frac{\sigma^2}{\tau^2}\|\mathbf{w}\|^2$, ridge with $\lambda = \sigma^2/\tau^2$.

4. Why lasso is sparse. The $L_1$ ball $\{\mathbf{w}:\|\mathbf{w}\|_1\leq t\}$ has corners on the coordinate axes. The OLS objective's elliptical level sets touch the ball most often at one of these corners, where one coordinate is zero. The $L_2$ ball is spherical with no corners; its tangent point is generically off-axis.

7. Cross-entropy convexity. Gradient: $\nabla L = \mathbf{X}^\top(\mathbf{p}-\mathbf{y}) + \lambda\mathbf{w}$. Hessian: $\mathbf{H} = \mathbf{X}^\top\mathbf{S}\mathbf{X} + \lambda\mathbf{I}$ with $\mathbf{S}=\text{diag}(p_i(1-p_i))$. Both terms are PSD, so $\mathbf{H}\succeq \lambda\mathbf{I}\succeq \mathbf{0}$, strictly convex when $\lambda>0$.

8. Softmax shift invariance. $\text{softmax}(\mathbf{z}+c\mathbf{1}) = \text{softmax}(\mathbf{z})$ because the constant cancels in numerator and denominator. Numerically, subtract $\max_k z_k$ before exponentiating to avoid overflow.

12. Cover–Hart sketch. Condition on the query $\mathbf{x}$. The probability that 1-NN errs is $\sum_k p_k(\mathbf{x})(1-p_k(\mathbf{x}^*))$ where $\mathbf{x}^*$ is the nearest neighbour and $p_k$ is the conditional class probability. As $n\to\infty$, $\mathbf{x}^*\to\mathbf{x}$ almost surely (under mild assumptions) and the error becomes $1 - \sum_k p_k(\mathbf{x})^2 \leq 2(1-\max_k p_k(\mathbf{x}))$. Integrating over $\mathbf{x}$ gives $R_{1\text{NN}}\leq 2R^*$.

13. Gini and entropy formulas. Gini: $G = p(1-p)+(1-p)p = 2p(1-p)$. Entropy: $H = -p\log_2 p - (1-p)\log_2(1-p)$. Both are concave with maximum at $p=0.5$ (Gini = 0.5, entropy = 1 bit); both vanish at $p\in\{0,1\}$.

18. Soft-margin dual. Lagrangian $\mathcal{L} = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum\xi_i - \sum\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i+b)-1+\xi_i] - \sum \mu_i \xi_i$ with $\alpha_i,\mu_i\geq 0$. Stationarity in $\xi_i$: $C - \alpha_i - \mu_i = 0$, so $\alpha_i = C - \mu_i \in [0, C]$. The remaining steps mirror the hard-margin derivation.

22. Variance of an average. $\text{Var}\!\big(\frac{1}{M}\sum_m \hat f_m\big) = \frac{1}{M^2}[M\sigma_f^2 + M(M-1)\rho\sigma_f^2] = \frac{(1-\rho)}{M}\sigma_f^2 + \rho\sigma_f^2$. Setting $\rho=0$ recovers the classical $1/M$ rate.

23. AdaBoost from exponential loss. At round $t$ with margin $F_{t-1}$, fix the loss as the exponential and minimise over $(\alpha, h)$. The expression splits into correctly and incorrectly classified pieces with weights $w_i^{(t)} = e^{-y_iF_{t-1}(\mathbf{x}_i)}$. Choosing $h$ to minimise weighted error and then $\alpha$ to minimise the resulting one-dimensional convex function reproduces the AdaBoost weight $\alpha_t = \tfrac12 \log\frac{1-\varepsilon_t}{\varepsilon_t}$ exactly.

24. Log-loss pseudo-residuals. $\ell(y, F) = -y\log\sigma(F) - (1-y)\log(1-\sigma(F))$. Differentiating, $\partial \ell/\partial F = \sigma(F) - y$. The negative gradient (the pseudo-residual) is $y - \sigma(F)$, exactly the "logistic residual" we have seen throughout.

26. Precision–recall trade-off. Two classifiers each correct on 90 % of a balanced 1000-example dataset; one errs uniformly (90 % precision, 90 % recall, F1 = 0.9), the other concentrates errors on positives (95 % precision but 85 % recall, F1 ≈ 0.90). Same accuracy, different operating points; choose by which class's errors are costlier.

27. AUC interpretation. Pick a positive $X^+$ and a negative $X^-$ independently. $P(\hat s(X^+) > \hat s(X^-))$ is, by definition, the area under the ROC curve, where ROC is parameterised by sweeping the threshold from $\infty$ to $-\infty$ and tracking $(FPR, TPR)$.

28. Bayes' theorem on imbalanced screening. With prevalence $\pi=0.001$, recall $0.95$, specificity $0.99$ (so FPR $=0.01$): $P(+\mid y=1)\pi = 0.001 \cdot 0.95 = 0.00095$; $P(+\mid y=0)(1-\pi) = 0.999\cdot 0.01 = 0.00999$. Precision $= 0.00095/(0.00095+0.00999) \approx 0.087$. F1 $\approx 2\cdot(0.087\cdot 0.95)/(0.087+0.95)\approx 0.16$. Posterior odds of true positive given a flag $= 0.00095/0.00999 \approx 0.095$, i.e., a flagged email is true spam less than 9 % of the time despite the 99 % specificity. The lesson, extreme imbalance dominates apparently strong test characteristics, is the core of the Bayesian critique of medical screening.

These exercises and sketches are deliberately broad. Anyone who can comfortably tackle two-thirds of them has the working knowledge of supervised learning that the remainder of this book, neural networks, sequence models, generative models, will assume.

Textbook of AI

7.16 Exercises

Selected solution sketches

Further Learning