Exercises

  1. (Perceptron geometry.) Plot the four points $(0,0), (0,1), (1,0), (1,1)$ in the plane and indicate the decision boundary $w_1 x_1 + w_2 x_2 + b = 0$ for the OR function with $w_1 = w_2 = 1$, $b = -0.5$. Verify that all four points are correctly classified.

  2. (Convergence theorem.) Suppose $R = 1$ and $\gamma = 0.01$. What is the maximum number of mistakes the perceptron can make? Suppose the data is rescaled so that $R = 100$ but $\gamma$ doubles. How does the bound change?

  3. (XOR by hand.) Construct a two-layer perceptron with hard-threshold activations that computes XOR. Specify all weights and biases.

  4. (XOR with sigmoids.) Repeat exercise 3 using sigmoid activations. Why does a hard-threshold solution not directly translate?

  5. (Forward pass.) For the network of Section 9.3, compute the forward pass on $\mathbf{x} = (0, 1)^\top$.

  6. (Vanishing sigmoids.) Consider a 20-layer network with sigmoid activations. Approximate the typical magnitude of $\sigma'(z)$ at saturation and bound the gradient at the input layer in terms of the gradient at the output. Why is this a problem?

  7. (ReLU derivative at zero.) What is $\mathrm{ReLU}'(0)$? Discuss why this discontinuity is harmless in practice.

  8. (Softmax invariance.) Show that $\mathrm{softmax}(\mathbf{z} + c \mathbf{1}) = \mathrm{softmax}(\mathbf{z})$ for any scalar $c$. Why does this matter for numerical stability?

  9. (Cross-entropy gradient.) Verify by direct calculation that for $\mathbf{p} = \mathrm{softmax}(\mathbf{z})$ and one-hot $\mathbf{y}$, $\partial \mathcal{L}_{\text{CE}} / \partial z_i = p_i - y_i$.

  10. (Sigmoid + BCE gradient.) Show that for $p = \sigma(z)$ and binary cross-entropy, $\partial \mathcal{L}_{\text{BCE}} / \partial z = p - y$.

  11. (Backprop by hand.) Repeat the worked example of Section 9.7 with $\mathbf{x} = (0, 1)^\top$ and $y = 0$. Compute every gradient, then a single SGD step with $\eta = 0.1$.

  12. (Universal approximation construction.) Sketch a one-hidden-layer ReLU network with two hidden units that approximates $f(x) = |x|$ on $[-1, 1]$ exactly. Generalise to $K$ hidden units approximating any continuous piecewise linear function.

  13. (Depth efficiency.) Construct a function on $[0, 1]$ that requires $\Omega(2^L)$ width at depth 1 but is computable by a depth-$L$ ReLU network of width 2. (Hint: use the triangle wave by composing reflections.)

  14. (Xavier derivation.) Derive Glorot's variance bound by demanding that the variance of pre-activations and gradients be preserved at initialisation. State your assumptions.

  15. (He derivation.) Repeat for ReLU. Why does the variance double?

  16. (Dying ReLU.) Construct a small example (specific weights, biases, and data) in which a ReLU unit has zero output and zero gradient on every training point.

  17. (Batch norm Jacobian.) Verify the batch-norm backward formula given in Section 9.13 by differentiating through $\hat{z}_b = (z_b - \mu_B) / \sqrt{\sigma_B^2 + \varepsilon}$.

  18. (Layer norm vs batch norm.) Explain why layer norm is preferred for small batch sizes and for sequence models with variable-length inputs.

  19. (Dropout as ensemble.) For a network with $K$ hidden units and dropout rate $p$, how many distinct subnetworks can dropout sample?

  20. (L2 and weight decay.) Prove that for plain SGD, adding $\lambda \|\mathbf{W}\|^2 / 2$ to the loss is exactly equivalent to a weight-decay rule $\mathbf{W} \leftarrow (1-\eta\lambda)\mathbf{W} - \eta\mathbf{g}$. Show that for Adam they are not equivalent.

  21. (Gradient clipping.) Implement gradient clipping by global norm in NumPy. Test it on the MNIST MLP.

  22. (Cosine schedule.) Plot a cosine annealing schedule with warmup over 10% of training. Why is warmup needed for Transformer-scale learning rates?

  23. (Mini-batch noise.) Explain why SGD with batch size 32 often generalises better than full-batch GD. Cite at least two mechanisms.

  24. (Numerical gradient check.) Use finite differences to verify the gradient of a small MLP. Why is this only practical for small networks?

  25. (MNIST baseline.) Train the from-scratch MLP of Section 9.16 with three different learning rates. Report training and test accuracy. Plot the curves.

  26. (MNIST PyTorch.) Train the PyTorch MLP of Section 9.17 with and without dropout. Compare test accuracy.

  27. (Activation comparison.) Train an MLP with sigmoid, tanh, ReLU, and GELU on MNIST. Report the wall-clock time to reach 95% test accuracy for each.

  28. (Initialisation sweep.) Train a 10-layer MLP on MNIST with all-zero, $\mathcal{N}(0, 1)$, Xavier, and He initialisation. Explain the results.

  29. (Autograd extension.) Extend the autograd engine of Section 9.15 with __pow__, exp, and log. Verify that gradients agree with finite differences.

  30. (Transformer activation.) Replace ReLU with GELU in the PyTorch MLP. Does it help on MNIST? Why might the answer differ on a Transformer language model?

  31. (Label smoothing.) Implement label smoothing with $\varepsilon = 0.1$. Train on MNIST and report the calibration improvement (expected calibration error).

  32. (Mixup.) Implement mixup with $\alpha = 0.2$ on MNIST. Report training and test accuracy with and without mixup.

  33. (Symmetric initialisation.) Prove that if every weight in a layer is initialised to the same value (and all biases are zero), then every unit in that layer computes the same function for all inputs throughout training. Conclude that breaking symmetry requires randomness.

  34. (Backprop computational cost.) Show that for an MLP with sizes $(d_0, d_1, \ldots, d_L)$, the backward pass costs the same number of FLOPs as the forward pass to within a factor of two. Identify the precise factor.

  35. (AdamW vs Adam.) Show analytically that for a parameter on which the running second moment $\hat{v}_t$ is large, Adam with L2 regularisation applies less effective weight decay than AdamW with the same nominal $\lambda$. Construct a synthetic example where this matters.

  36. (Saturating ReLU on negative-mean input.) Suppose a hidden unit's pre-activation has distribution $\mathcal{N}(\mu, 1)$ with $\mu = -3$. What is the probability that the unit is active on a random input? What does this imply for He initialisation when input distributions are not zero-mean?

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).