Training & Optimisation: Solution sketches

Dr Chris Paton

Solution sketches

S10.1. The descent lemma says $L(\theta - \eta g) \le L(\theta) - \eta \|g\|^2 + \tfrac{\beta\eta^2}{2}\|g\|^2$. With $\eta = 1/\beta$, this gives $L(\theta_{t+1}) \le L(\theta_t) - \tfrac{1}{2\beta}\|\nabla L(\theta_t)\|^2$. For convex $L$, $L(\theta_t) - L^\star \le \nabla L(\theta_t)^\top (\theta_t - \theta^\star)$, and combining with the descent inequality and a potential function argument with $\Phi_t = \|\theta_t - \theta^\star\|^2$ gives $L(\theta_T) - L^\star \le \beta \|\theta_0 - \theta^\star\|^2 / (2T)$, the standard $O(1/T)$ rate.

S10.2. With sampling without replacement, $\hat g = \tfrac{1}{B}\sum_{i \in \mathcal B} g_i$ where the population is $\{g_1, \ldots, g_N\}$ with mean $\bar g = \nabla L$. Standard finite-population statistics give $\operatorname{Var}(\hat g) = \tfrac{\sigma^2}{B} \cdot \tfrac{N - B}{N - 1}$. As $B \to N$ the second factor goes to zero, so the variance vanishes (we recover the full-batch deterministic gradient).

S10.3. Gradient descent on $L = \tfrac12 \theta^\top A \theta$ gives $\theta_{t+1} = (I - \eta A)\theta_t$. The error in eigenbasis components contracts by $|1 - \eta\lambda|$. The maximum over $\lambda \in [\alpha,\beta]$ is minimised when $|1 - \eta\alpha| = |1 - \eta\beta|$, giving $\eta = 2/(\alpha+\beta)$ and worst-case contraction $(\beta-\alpha)/(\beta+\alpha) = (\kappa-1)/(\kappa+1)$.

S10.4. In the eigenbasis, Polyak's iteration is $\theta_{t+1}^{(\lambda)} = (1 + \mu - \eta\lambda)\theta_t^{(\lambda)} - \mu\theta_{t-1}^{(\lambda)}$. The characteristic polynomial is $z^2 - (1+\mu-\eta\lambda)z + \mu = 0$ with roots $r$. Contraction is $\max_\lambda |r(\lambda)|$. Setting $\eta = 4/(\sqrt\beta+\sqrt\alpha)^2$, $\mu = ((\sqrt\kappa-1)/(\sqrt\kappa+1))^2$ makes both roots equal to $\sqrt\mu = (\sqrt\kappa-1)/(\sqrt\kappa+1)$ for the extreme eigenvalues, achieving the claimed rate.

S10.5. $m_1 = (1-\beta_1) g_1$, so $\mathbb{E}[m_1] = (1-\beta_1)\mu_g \neq \mu_g$. In general $m_t = (1-\beta_1)\sum_{i=1}^t \beta_1^{t-i} g_i$, with $\mathbb{E}[m_t] = (1-\beta_1)\mu_g \sum_{i=1}^t \beta_1^{t-i} = (1-\beta_1^t)\mu_g$. Dividing by $1-\beta_1^t$ gives the unbiased estimate $\hat m_t = m_t/(1-\beta_1^t)$ with $\mathbb{E}[\hat m_t] = \mu_g$.

S10.6. For naive Adam-with-L2, the gradient is $g + \lambda\theta$, the update is $\eta(\hat m + \lambda\theta_{\mathrm{adam}})/(\sqrt{\hat v}+\epsilon)$. The effective per-parameter shrinkage is $\eta\lambda/(\sqrt{\hat v}+\epsilon)$, proportional to the inverse second moment. AdamW separates the decay: $\theta \leftarrow (1-\eta\lambda)\theta - \eta\hat m/(\sqrt{\hat v}+\epsilon)$. Now the shrinkage is uniform $\eta\lambda$ regardless of $\hat v$.

S10.7. $\bar\eta = \tfrac{1}{T}\int_0^T \eta_t\, dt = \eta_{\min} + \tfrac12(\eta_{\max}-\eta_{\min}) \cdot \tfrac{1}{T}\int_0^T (1+\cos(\pi t/T))\,dt$. The integral of $\cos$ over a full half-cycle is zero, so $\bar\eta = \eta_{\min} + \tfrac12(\eta_{\max} - \eta_{\min}) = (\eta_{\max} + \eta_{\min})/2$.

S10.9. FP32 AdamW for 70B parameters: 4 bytes × 4 tensors (param, grad, $m$, $v$) × $7 \times 10^{10}$ = 1.12 TB. Plus activation memory (typically another factor of 2–3 for an LLM), so effectively 2–3 TB. Way too much for any single GPU. With ZeRO-3 across 1024 GPUs: divide by 1024, $\approx 1.1$ GB per GPU plus a layer's worth of activations during forward pass. Fits in 80 GB with substantial headroom.

S10.10. Bubble fraction = $(K-1)/(K + M - 1) = 7/71 \approx 9.9\%$. To get under 5%: $7/(7 + M - 1) < 0.05 \Rightarrow M > 134$. Need at least 134 micro-batches.

S10.13. Logistic regression is $\beta$-smooth with $\beta \approx \|X\|^2/(4N)$, for normalised features and $N = 10^4$, $\beta \approx 1$. Convergence requires $T = O(\beta D^2/\epsilon) = O(1 \cdot 25 / 10^{-3}) = O(2.5 \times 10^4)$ steps, or roughly 250 epochs with batch 100.

S10.16. Linear scaling rule says LR $\to$ $16\eta$. Two failure modes: (a) at the start of training, gradients are large and the dynamics are non-linear, fix with longer warmup (e.g. scale warmup steps by 16x too); (b) batch size has exceeded the critical batch size for this problem, no fix at the algorithmic level, you need a different optimiser like LAMB or accept sub-linear scaling.

S10.17. FP16 dynamic range is $\sim 10^{-5}$ to $10^4$. Gradients commonly fall below $10^{-5}$ and underflow to zero. Loss scaling multiplies the loss by $S$, scaling gradients into the representable range; the optimiser unscales before stepping. BF16 has the same exponent range as FP32, so gradients almost never underflow, no loss scaling needed.

S10.18. $L(x, y) = x^2 - y^2$. Gradient $(2x, -2y)$, zero at the origin. Eigenvalues $(2, -2)$: saddle. Starting from $(0, y_0)$, the $x$-coordinate stays zero forever and $y$ grows: we ride the unstable axis away from the saddle but never escape it (we sit on the gradient flow line). In SGD, noise in $x$ pushes us off the unstable axis and we then descend rapidly along the negative-curvature direction.

This concludes the chapter. Optimisation is where deep learning meets the long tradition of applied mathematics, and the engineering details we have surveyed, from $\beta$-smoothness through ZeRO-3, are the price of admission to working at the frontier. Most of what is in this chapter will still be true in ten years; some will not. The instinct to verify, derive, and debug is the part that lasts.

Textbook of AI

Solution sketches

Further Learning