3.16 Exercises
Mark exercises with a star (★) if they go beyond the chapter as written; double star (★★) if they need a small piece of computation. Solution sketches for a representative selection follow at the end of the section.
Limits, derivatives, and Taylor
Using the limit definition, compute $f'(x)$ for (a) $f(x) = 1/x$, (b) $f(x) = \sqrt{x}$, and (c) $f(x) = e^{2x}$.
Show that the function $f(x) = x^2 \sin(1/x)$ for $x \neq 0$, $f(0) = 0$, is differentiable everywhere, but $f'$ is not continuous at $0$.
Use Taylor's theorem to compute $\sqrt{1.1}$ to three decimal places, and bound the error of your approximation using the Lagrange remainder.
State the second-order Taylor expansion of $\log(1 + x)$ around $x = 0$ and use it to estimate $\log(1.05)$.
Prove that if $f$ is differentiable on $(a, b)$ with $|f'(x)| \le M$ everywhere, then $|f(x) - f(y)| \le M|x - y|$ for all $x, y \in (a, b)$.
Gradients, Jacobians, chain rule
Let $f(x_1, x_2) = x_1^2 e^{x_2}$. Compute $\nabla f$ and the Hessian $H f$. Find all critical points.
Compute the Jacobian of $F(x, y) = (x^2 + y^2, \, x y, \, \sin(x + y))$. Evaluate at $(1, 0)$.
★ Let $g(u) = u^\top A u$ for a symmetric matrix $A \in \mathbb{R}^{m \times m}$ and $f(x) = Bx$ for $B \in \mathbb{R}^{m \times n}$. Use the chain rule to derive $\nabla_x (g \circ f)(x)$.
Compute $\nabla_x \log(1 + e^{a^\top x})$ for a constant vector $a$.
Show that the level sets of $f(x, y) = x^2 + 4 y^2$ are ellipses, and that the gradient at $(1, 1)$ is perpendicular to the ellipse passing through that point.
★★ Verify symbolically (or by Python with
sympy) that for $F(x, y) = (\sin(x) \cos(y), \, \cos(x) \sin(y))$, the chain rule $J_{F \circ F} = J_F(F(\cdot)) \, J_F(\cdot)$ holds at $(\pi/4, \pi/3)$.
Computational graphs and reverse-mode AD
Draw the computational graph for $f(x_1, x_2, x_3) = (x_1 + x_2)(x_2 + x_3)(x_1 + x_3)$. Compute the forward and backward passes by hand at $(x_1, x_2, x_3) = (1, 2, -1)$, and verify the gradient by direct differentiation.
★ Extend the
Valueengine of §3.7.1 to supportlog(natural logarithm). Use it to compute $\partial f / \partial x$ for $f(x) = \log(\sigma(2 x + 3))$ at $x = 0$, and verify against PyTorch.★ Modify the
Valueengine so that $a \cdot a$ correctly accumulates a gradient of $2a$ at the input rather than relying on the same node being reached twice. Trace through how the+=accumulation in_backwardalready does the right thing.Implement, using only the engine of §3.7.1, a function
mse_loss(preds, targets)for two equal-length lists ofValues, and use it to fit a one-input, one-output linear regression by gradient descent. Compare to the closed-form least-squares solution.Estimate the wall-clock cost ratio of a backward pass to a forward pass for an MLP of your choice (e.g. 2-1024-1024-10) using PyTorch. Does the ratio match the rule-of-thumb 2× to 3×?
Forward vs reverse mode
★ For $F(x) = \sin(\sin(\sin(x)))$ in one variable, compute $F'(x)$ both via forward-mode AD (propagating a tangent) and via reverse-mode AD (propagating an adjoint). Show numerically that both give the same answer.
Suppose $F : \mathbb{R}^{1000} \to \mathbb{R}^{2}$ is a smooth function. Which mode of AD computes its full Jacobian more cheaply? Estimate the speed-up ratio.
Suppose $F : \mathbb{R}^{2} \to \mathbb{R}^{1000}$ is a smooth function. Same question.
Gradient identities
Derive $\nabla_x (\|x - a\|^2 + \lambda \|x\|^2)$ from first principles. Find the minimiser.
Verify the softmax-cross-entropy gradient $\nabla_z \mathrm{CE}(\mathrm{softmax}(z), y) = p - y$ for $z = (1, 2, 3)$, $y = (0, 1, 0)$.
Compute $\nabla_w \log(1 + e^{-y w^\top x})$ for $y \in \{-1, +1\}$, $x \in \mathbb{R}^n$, $w \in \mathbb{R}^n$. (This is the binary logistic loss in $\pm 1$ encoding.)
★ Show that for a symmetric positive-definite $A$, $\nabla_x \tfrac{1}{2} (x - x_0)^\top A (x - x_0) = A(x - x_0)$, and hence the Newton step at any point lands on $x_0$ in one iteration.
★ Derive $\nabla_A \mathrm{tr}(A^\top B A)$ for $A \in \mathbb{R}^{n \times m}$, $B \in \mathbb{R}^{n \times n}$ symmetric. (Hint: $\mathrm{d} \, \mathrm{tr}(A^\top B A) = \mathrm{tr}((\mathrm{d} A)^\top B A) + \mathrm{tr}(A^\top B \, \mathrm{d} A) = 2 \, \mathrm{tr}(A^\top B \, \mathrm{d} A)$.)
Gradient descent and curvature
For $L(w) = \tfrac{1}{2} w^\top A w$ with $A = \mathrm{diag}(1, 100)$, what is the largest stable learning rate? Sketch the trajectory of gradient descent from $(1, 1)$ with $\eta = 0.01$.
Implement gradient descent on $L(\theta) = \theta_1^2 + 10 \theta_2^2$ from $(1, 1)$ with $\eta = 0.05$. Plot the trajectory. Repeat with momentum $\mu = 0.9$ and compare convergence speed.
Identify the critical points of $f(x, y) = x^4 - 2 x^2 + y^2$ and classify each using the Hessian.
★ Show that one step of Newton's method on a positive-definite quadratic finds the global minimum exactly. Why does this not immediately give us a fast algorithm for deep networks?
★ Saddle-point exercise. Show that for $f(x, y) = x^2 - y^2$, plain gradient descent from $(1, 0.0001)$ approaches the saddle $(0, 0)$ along the $x$-axis but escapes along the $y$-axis only because of the small initial perturbation. What does this tell you about the role of stochastic noise near saddles?
Numerical pitfalls
Use the central difference with $h = 10^{-3}, 10^{-5}, 10^{-7}, 10^{-10}$ to estimate $f'(1)$ for $f(x) = e^x$. At which $h$ is the error smallest? Why?
Implement a numerically stable softmax and verify it agrees with
torch.softmaxon inputs ranging from $z = (-1000, -999, \ldots, -990)$. What does the naive formula give?★ Find a small PyTorch program that appears to train a network but in fact has zero gradient because of an autograd bug (e.g. a forgotten
requires_grad, an in-place operation, anargmaxsomewhere in the loss). Diagnose and fix it.
Bonus
★★ Reimplement micrograd (the engine of §3.7) using only NumPy, but extend it to support tensors of arbitrary shape. Train a 2-layer MLP on the MNIST 4 vs 9 problem with cross-entropy loss. Verify that gradients agree with PyTorch to within $10^{-4}$ relative error.
★ Show that for a feed-forward network with $L$ identical linear layers and tanh non-linearities, the gradient of the loss with respect to the input scales as roughly $\prod_{l=1}^L \|W_l\| \cdot \prod_{l=1}^L \max_x |\tanh'(x)| \le \prod_l \|W_l\|$. Explain why this leads to vanishing gradients when each $\|W_l\| < 1$ and exploding gradients when each $\|W_l\| > 1$.
★ Look up the checkpointing trick (Chen et al., 2016) and explain how it trades compute for memory in long backward passes. Estimate the memory saving for a 100-layer transformer if checkpointing is applied every $\sqrt{L}$ layers.
Solution sketches
Exercise 1. (a) $f(x+h) - f(x) = -h / [x(x+h)]$, so $f'(x) = -1/x^2$. (b) Rationalise: $[\sqrt{x+h} - \sqrt{x}]/h = h / \bigl[h(\sqrt{x+h} + \sqrt{x})\bigr] \to 1/(2\sqrt{x})$. (c) $[e^{2(x+h)} - e^{2x}]/h = e^{2x} (e^{2h} - 1)/h \to 2 e^{2x}$.
Exercise 3. $\sqrt{1.1} = (1 + 0.1)^{1/2} \approx 1 + 0.05 - 0.00125 + \cdots$. Up to second order, $\sqrt{1.1} \approx 1.04875$. With $f(x) = x^{1/2}$, $f'''(x) = (3/8) x^{-5/2}$, bounded above on $(1, 1.1)$ by $3/8$, so the Lagrange remainder for the third-order term is at most $3/8 \cdot (0.1)^3 / 6 = 6.25 \times 10^{-5}$. The true value is $1.04881\ldots$, error $\approx 6 \times 10^{-5}$, agreeing.
Exercise 6. Gradient: $\nabla f = (2 x_1 e^{x_2}, \, x_1^2 e^{x_2})$. Setting it to zero: $2 x_1 e^{x_2} = 0 \Rightarrow x_1 = 0$, then $x_1^2 e^{x_2} = 0$ automatically. So the critical points form the line $\{(0, x_2) : x_2 \in \mathbb{R}\}$. The Hessian $$ H f = \begin{pmatrix} 2 e^{x_2} & 2 x_1 e^{x_2} \\ 2 x_1 e^{x_2} & x_1^2 e^{x_2} \end{pmatrix} $$ at $x_1 = 0$ is $\mathrm{diag}(2 e^{x_2}, 0)$, singular, so the second-derivative test is inconclusive. Along the line $x_1 = 0$, $f \equiv 0$, while perturbing $x_1$ off the line gives $f > 0$, so each critical point is a non-strict minimum.
Exercise 8. $\nabla_x (g \circ f)(x) = J_f^\top \, \nabla g(f(x)) = B^\top \cdot 2 A f(x) = 2 B^\top A B x$, since $A$ is symmetric. (For non-symmetric $A$, the result is $B^\top (A + A^\top) B x$.)
Exercise 9. Let $u = a^\top x$, so $\nabla_x \log(1 + e^u) = \mathrm{softplus}'(u) \cdot \nabla_x u = \sigma(u) \cdot a = \sigma(a^\top x) \, a$.
Exercise 12. Let $a = x_1 + x_2$, $b = x_2 + x_3$, $c = x_1 + x_3$, $f = a b c$. Forward at $(1, 2, -1)$: $a = 3$, $b = 1$, $c = 0$, $f = 0$. Backward: $\bar{f} = 1$, $\bar{a} = b c = 0$, $\bar{b} = a c = 0$, $\bar{c} = a b = 3$. Then $\bar{x}_1 = \bar{a} + \bar{c} = 3$, $\bar{x}_2 = \bar{a} + \bar{b} = 0$, $\bar{x}_3 = \bar{b} + \bar{c} = 3$. Direct differentiation confirms: $\partial f / \partial x_1 = (x_2 + x_3)(x_1 + x_3) + (x_1 + x_2)(x_2 + x_3) = 0 + 3 = 3$, similarly for the others. ✓
Exercise 18. $F : \mathbb{R}^{1000} \to \mathbb{R}^{2}$. Reverse mode produces a row of the Jacobian per backward pass, so two backward passes suffice for the full Jacobian. Forward mode produces a column per forward pass and requires 1000 of them. Reverse mode is roughly 500× cheaper.
Exercise 19. $F : \mathbb{R}^2 \to \mathbb{R}^{1000}$. Now forward mode wins: two forward passes give the full Jacobian, vs 1000 backward passes for reverse mode.
Exercise 21. $z = (1, 2, 3)$. Compute $e^z = (e, e^2, e^3) \approx (2.718, 7.389, 20.085)$, sum $\approx 30.193$. So $p \approx (0.0900, 0.2447, 0.6652)$. With $y = (0, 1, 0)$, $p - y \approx (0.0900, -0.7553, 0.6652)$. Sanity check: components sum to zero (they must, because shifting all logits by a constant does not change softmax).
Exercise 22. Let $u = -y w^\top x$, $\ell = \log(1 + e^u)$. Then $\partial \ell / \partial u = \sigma(u)$, and $\nabla_w u = -y x$. So $\nabla_w \ell = -y x \, \sigma(-y w^\top x) = -y x \, \bigl[1 - \sigma(y w^\top x)\bigr]$.
Exercise 25. $A = \mathrm{diag}(1, 100)$, so $\lambda_{\max} = 100$, and the largest stable learning rate is $2/100 = 0.02$. With $\eta = 0.01$ the trajectory takes $\theta_2$ to zero in a single step (multiplier $1 - 100 \cdot 0.01 = 0$) and converges geometrically in $\theta_1$ at rate $1 - 0.01 = 0.99$ per step.
Exercise 28. A positive-definite quadratic $f(\theta) = \tfrac{1}{2} (\theta - \theta^*)^\top A (\theta - \theta^*)$ has $\nabla f = A (\theta - \theta^*)$ and $H f = A$. The Newton update $\theta - A^{-1} \nabla f = \theta - (\theta - \theta^*) = \theta^*$. Done in one step. For deep networks the Hessian is too large to invert and is rarely positive definite anyway, so the construction does not generalise directly, practical methods approximate $H^{-1}$.
Exercise 30. For $f(x) = e^x$, $f'(1) = e \approx 2.71828$. Central difference at $h = 10^{-3}$: error $\sim h^2 f''' / 6 \approx 4.5 \times 10^{-7}$. At $h = 10^{-5}$: round-off begins to compete; expect error around $10^{-11}$. At $h = 10^{-7}$: round-off dominates. At $h = 10^{-10}$: the difference $f(x+h) - f(x-h)$ is in the noise of double precision, so the error is $O(\varepsilon_{\text{mach}}/h) \sim 10^{-6}$. The optimum lies near $h \sim 10^{-5}$ to $10^{-6}$.
Exercise 33. Sketch: replace Value storage by a NumPy array; for each op define a forward (a NumPy expression) and a backward closure that performs the relevant VJP using NumPy broadcasting. For a 2-layer MLP, the ops you need are matrix multiplication, broadcasted addition, ReLU, softmax (or fused softmax-cross-entropy), and sum. With careful broadcasting rules the backward pass is straightforward, about 200 lines of NumPy. Train with mini-batch SGD for a few epochs.
Exercise 34. With identical linear layers $W$ and tanh non-linearities, the Jacobian of layer $l$'s output with respect to its input is $\mathrm{diag}(\tanh'(z_l)) \, W$. The product over $L$ layers has spectral norm bounded by $\|W\|^L \cdot (\max_x \tanh'(x))^L \le \|W\|^L$. If $\|W\| < 1$ this shrinks geometrically (vanishing gradients); if $\|W\| > 1$ it grows geometrically (exploding gradients). The standard remedies, identity-preserving initialisation, residual connections, normalisation layers, gradient clipping, each break this multiplication in some way.