Exercises
E10.1. State the descent lemma for a $\beta$-smooth function and use it to derive the convergence rate of gradient descent on a convex $\beta$-smooth function.
E10.2. Derive the variance of a mini-batch gradient estimate for batch size $B$ when sampling without replacement from a dataset of size $N$. Show that the variance vanishes as $B \to N$.
E10.3. Show that for a quadratic loss $L(\theta) = \tfrac12 \theta^\top A \theta$ with eigenvalues in $[\alpha, \beta]$, the optimal step size for gradient descent is $\eta = 2/(\alpha + \beta)$ and the resulting contraction is $(\kappa - 1)/(\kappa + 1)$ where $\kappa = \beta/\alpha$.
E10.4. Prove that Polyak's heavy-ball method on a quadratic loss with optimally chosen $\eta, \mu$ contracts at rate $(\sqrt\kappa - 1)/(\sqrt\kappa + 1)$.
E10.5. Explain why bias correction is necessary in Adam. Compute $\mathbb{E}[m_1]$ and $\mathbb{E}[m_t]$ assuming gradients are i.i.d. with mean $\mu_g$, and verify that the correction recovers $\mu_g$.
E10.6. Derive the AdamW update rule and explain why it differs from naive Adam-with-L2 in practice. For Adam with parameter $\theta$ and a quadratic L2 penalty $\tfrac{\lambda}{2}\|\theta\|^2$, write the per-parameter shrinkage and show that it depends on the second moment.
E10.7. Show that for the cosine annealing schedule $\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t/T))$, the average learning rate over a full cycle is $(\eta_{\max} + \eta_{\min})/2$.
E10.8. Derive the linear scaling rule. Start from $k$ steps of mini-batch SGD with batch $B$ and show that under the assumption $g_{t+j} \approx g_t$ for $j < k$, the result is equivalent to a single step at batch $kB$ with learning rate $k\eta$.
E10.9. A network has 70 billion parameters trained with AdamW in FP32. Compute the memory required for parameters, gradients, first moment, second moment. Now compute the same with ZeRO-3 sharding across 1024 GPUs.
E10.10. Pipeline parallelism with $K = 8$ stages and $M = 64$ micro-batches. Compute the bubble fraction. How many micro-batches are needed to keep the bubble below $5\%$?
E10.11. Loss scaling in FP16. The smallest representable positive normal FP16 value is $\approx 6.1 \times 10^{-5}$. If your gradient values are typically in the range $[10^{-7}, 10^{-3}]$, what loss scale $S$ keeps them all representable without overflow?
E10.12. Using the descent lemma, derive a bound on the noise floor that SGD achieves with constant step $\eta$ on a $\beta$-smooth $L$ with gradient noise variance $\sigma^2/B$. Show that the asymptotic loss has a floor proportional to $\eta \sigma^2 / B$.
E10.13. For a logistic regression with $d = 1000$ parameters and $N = 10000$ training examples, the loss is convex and approximately $\beta$-smooth with $\beta \approx 1$. Estimate how many SGD steps are needed to reach $\epsilon = 10^{-3}$ training loss starting from $\theta_0 = 0$ with $\|\theta^\star\| \approx 5$.
E10.14. Implement Polyak heavy-ball momentum, Nesterov accelerated gradient, AdaGrad, RMSProp, and Adam from scratch in NumPy. Test all five on the convex quadratic $L(\theta) = \tfrac12 \theta^\top A \theta$ with $A = \mathrm{diag}(1, 10, 100)$ and report convergence rates.
E10.15. A Transformer pretraining run shows the loss curve dropping smoothly for 10 000 steps, then suddenly spiking by 5× at step 10 437 before recovering. Diagnose three plausible causes and propose a fix for each.
E10.16. Suppose you scale a successful training run from a $256$ batch size to a $4096$ batch size on the same dataset. State the linear scaling rule prescription. Name two effects that might cause the scaling rule to fail and how you would mitigate each.
E10.17. Mixed precision: a network trains stably in FP32 but diverges in FP16 with naive (no loss scaling) implementation. Diagnose the failure mode. If you switch to BF16, do you still need loss scaling? Why or why not?
E10.18. Saddle points: construct a 2D function with a saddle point at the origin. Show that exact gradient descent starting from a point on the unstable axis stalls at the origin, whereas adding any small noise eventually escapes.
E10.19. ZeRO-3 communication. For a 70B parameter model with batch size 1M tokens trained on 1024 GPUs at 80 GB each: how often does a parameter all-gather happen per training step? Estimate the total cross-GPU traffic per step in TB.
E10.20. The "warmup" phase in Adam. Argue why warmup is more critical for Adam than for SGD-with-momentum, by considering the bias-corrected second moment in the first 100 steps with $\beta_2 = 0.999$.
E10.21. Implement the cosine-with-warmup schedule from §10.17 in Python without using PyTorch's built-in. Plot it for total_steps = 100 000 and warmup_steps = 1000 with $\eta_{\max} = 3 \times 10^{-4}$, $\eta_{\min} = 3 \times 10^{-5}$.
E10.22. Gradient clipping: write Python code (NumPy) that takes a list of gradient tensors and clips by global L2 norm to threshold $c$. Verify that the resulting global norm is exactly $c$ when the input norm exceeds $c$ and unchanged otherwise.
E10.23. Random search vs Bayesian optimisation: design a simple synthetic 5-D hyperparameter search problem where the validation accuracy is $g(h) = 1 - \|h - h^\star\|^2 + \mathcal{N}(0, 0.05^2)$. Run 50 random samples and 50 BO iterations using a Gaussian process surrogate; compare the best-found values.
E10.24. Double descent: train a polynomial regression with degree $d$ on $n = 50$ points sampled from a noisy degree-3 ground truth. Plot test MSE as $d$ ranges from 1 to 100. Locate the interpolation threshold and observe whether test MSE drops again past it.
E10.25. Implement label smoothing for cross-entropy loss in PyTorch (do not use the built-in). Compare with hard-target cross-entropy on a small classifier and observe the calibration difference (expected calibration error).
E10.26. AdaGrad's "learning rate stall": train a 2-layer MLP on MNIST with AdaGrad for 100 epochs; plot $\eta_{\mathrm{eff}} = \eta/\sqrt{G_t}$ over time for a representative parameter. At what epoch does the effective learning rate fall below $10^{-5}$?
E10.27. Critical batch size: train ResNet-18 on CIFAR-10 with batch sizes $\{32, 128, 512, 2048, 8192\}$ for the same total compute (steps × batch). Plot final test accuracy against batch size and locate the critical batch size where the linear scaling rule starts to fail.
E10.28. Implement a tiny pipeline-parallel training step: split a 4-layer MLP across 2 GPUs (or 2 simulated devices), pass micro-batches forward and backward, and verify that the gradient agrees with single-device training to within numerical error.
E10.29. Lion vs AdamW. Run AdamW with $\eta = 3 \times 10^{-4}$ and Lion with $\eta = 3 \times 10^{-5}$ on a small Transformer language model. Plot training and validation loss; comment on memory usage.
E10.30. Stochastic gradient noise as regularisation: take a small CNN trained on MNIST. Train with batch size 256 and again with batch size 4 (16× more steps for the same epochs). Compare flat-minimum sharpness (the largest eigenvalue of the Hessian at the trained point, estimated via power iteration) and the test accuracy. Confirm the prediction that the smaller batch finds a flatter minimum.
E10.31. Show that for the inverse-square-root Transformer schedule $\eta_t = d^{-1/2} \min(t^{-1/2}, t \cdot t_{\mathrm{warm}}^{-3/2})$, the schedule is continuous at $t = t_{\mathrm{warm}}$ and the peak learning rate is $d^{-1/2} \cdot t_{\mathrm{warm}}^{-1/2}$.
E10.32. A common bug: training works on one GPU, fails on multi-GPU DDP with the loss being identical across all GPUs. What is wrong and how do you fix it?