Solution sketches
S2. Mistake bound is $(R/\gamma)^2 = 10{,}000$. After rescaling, $R = 100$ and $\gamma = 0.02$, giving $(100/0.02)^2 = 25{,}000{,}000$. The bound is not invariant to scale; only the ratio $R/\gamma$ matters in the original geometry, and rescaling the data uniformly does not affect that ratio, but the question rescales $R$ by 100 and $\gamma$ by only 2, which changes $R/\gamma$ by a factor of 50.
S3. Hidden layer: $h_1 = \mathbb{1}[x_1 + x_2 \ge 1]$, $h_2 = \mathbb{1}[x_1 + x_2 \ge 2]$. Output: $y = h_1 - h_2$. Equivalently $h_1$ fires for OR, $h_2$ fires only for AND, and their difference is XOR.
S5. Using §9.3's 2-2-1 sigmoid network with $\mathbf{W}^{(1)} = \begin{pmatrix} 0.5 & -0.3 \\ 0.2 & 0.8 \end{pmatrix}$, $\mathbf{b}^{(1)} = (0.1, -0.2)^\top$, $\mathbf{W}^{(2)} = (0.7, -0.5)$, $b^{(2)} = 0.05$ and $\mathbf{x} = (0, 1)^\top$: $\mathbf{z}^{(1)} = (-0.2, 0.6)^\top$, $\mathbf{a}^{(1)} \approx (0.4502, 0.6457)^\top$, $z^{(2)} \approx 0.0423$, $\hat y = \sigma(0.0423) \approx 0.5106$.
S6. $\sigma'(z) \le 0.25$ everywhere, with equality at $z = 0$. The gradient at the input is at most $0.25^{20} \approx 9 \times 10^{-13}$ times the gradient at the output, which is computationally indistinguishable from zero. This is why sigmoid networks deeper than about 5 layers are essentially untrainable without normalisation or skip connections.
S7. $\mathrm{ReLU}'(0)$ is undefined as a classical derivative (the left and right limits are 0 and 1 respectively). In practice frameworks return 0 (PyTorch and TensorFlow). Because the input distribution puts probability zero on hitting exactly $z = 0$ in floating point, the choice has no measurable impact on training.
S8. $\mathrm{softmax}_i(\mathbf{z} + c\mathbf{1}) = e^{z_i + c} / \sum_j e^{z_j + c} = e^c e^{z_i} / (e^c \sum_j e^{z_j}) = \mathrm{softmax}_i(\mathbf{z})$. Subtracting $\max_i z_i$ before exponentiation ensures all exponents are non-positive, avoiding overflow.
S9. See Section 9.9.
S10. Let $p = \sigma(z) = 1/(1+e^{-z})$. Then $\partial p / \partial z = p(1-p)$. The BCE loss is $-y\log p - (1-y)\log(1-p)$, with derivative w.r.t. $p$ equal to $-y/p + (1-y)/(1-p) = (p - y) / [p(1-p)]$. Multiplying by $\partial p/\partial z = p(1-p)$ gives $\partial \mathcal{L}/\partial z = p - y$. $\square$
S11. Forward pass on $\mathbf{x} = (0, 1)^\top$ using the §9.3 / §9.7 sigmoid 2-2-1 network: from S5, $\hat y \approx 0.5106$. Loss with $y = 0$: $\mathcal{L} = \tfrac{1}{2}(0 - 0.5106)^2 \approx 0.1303$. Output gradient: $\partial\mathcal{L}/\partial\hat y = \hat y - y = 0.5106$. Output delta with $\sigma'(z^{(2)}) = \hat y(1-\hat y) \approx 0.2499$: $\delta^{(2)} \approx 0.5106 \cdot 0.2499 \approx 0.1276$. Output weight gradient: $\delta^{(2)} \mathbf{a}^{(1)\top} \approx (0.0574, 0.0824)$; output bias gradient: $0.1276$. Hidden delta with $\sigma'(z^{(1)}) = (0.2475, 0.2288)$ and $\mathbf{W}^{(2)\top}\delta^{(2)} \approx (0.0893, -0.0638)$: $\boldsymbol{\delta}^{(1)} \approx (0.0221, -0.0146)$. Hidden weight gradient: $\boldsymbol{\delta}^{(1)} \mathbf{x}^\top = \begin{pmatrix} 0 & 0.0221 \\ 0 & -0.0146 \end{pmatrix}$; hidden bias gradient: $\boldsymbol{\delta}^{(1)}$. SGD step with $\eta = 0.1$ subtracts each gradient from the corresponding parameter.
S12. $|x| = \mathrm{ReLU}(x) + \mathrm{ReLU}(-x)$. With weights $w_1 = 1, w_2 = -1$, biases zero, and output weights $1, 1$, the network computes $|x|$ exactly. Generalising to a piecewise linear function $f$ with breakpoints $a_1 < \ldots < a_K$, use one ReLU per breakpoint: $f(x) \approx c_0 + \sum_k c_k \mathrm{ReLU}(x - a_k)$, with $c_k$ chosen to match slopes.
S13. The triangle wave $T(x)$ on $[0, 1]$ defined by $T(x) = 2x$ for $x \in [0, 1/2]$ and $T(x) = 2(1-x)$ for $x \in [1/2, 1]$ has two breakpoints, hence width 2. The composition $T^{\circ L}$ has $2^L$ breakpoints. A shallow ReLU network needs one unit per breakpoint to match.
S14. Assume zero-mean inputs and weights, weights independent of inputs. Forward: $\mathrm{Var}(z_i) = d_{\text{in}} \mathrm{Var}(W) \mathrm{Var}(x)$, so for $\mathrm{Var}(z) = \mathrm{Var}(x)$ require $\mathrm{Var}(W) = 1/d_{\text{in}}$. Backward similarly gives $\mathrm{Var}(W) = 1/d_{\text{out}}$. Glorot averages: $\mathrm{Var}(W) = 2/(d_{\text{in}} + d_{\text{out}})$.
S15. ReLU zeroes about half the activations, so $\mathrm{Var}(\mathrm{ReLU}(z)) = \tfrac{1}{2} \mathrm{Var}(z)$. To preserve $\mathrm{Var}(\mathrm{ReLU}(z))$ across layers, $\mathrm{Var}(W)$ must double: $\mathrm{Var}(W) = 2/d_{\text{in}}$.
S16. Take $W = (-1, -1)$, $b = -1$, and inputs $\mathbf{x} \in [0, 1]^2$. Pre-activation is at most $-1$, so the ReLU is always zero and the gradient w.r.t. $W$ and $b$ is zero. The unit is dead.
S19. Each unit is independently kept or dropped, so $2^K$ distinct masks. Dropout averages over an exponentially large family of subnetworks.
S20. For SGD, both reduce to $\mathbf{W} \leftarrow (1 - \eta\lambda)\mathbf{W} - \eta \mathbf{g}$. For Adam, the L2 gradient $\lambda \mathbf{W}$ is divided by $\sqrt{\hat{v}_t}$ (the running RMS estimate), giving an effective per-parameter weight decay scaled by $1/\sqrt{\hat{v}_t}$. This is generally not what you want; AdamW restores decoupled decay.
S23. (i) Mini-batch noise acts as an implicit regulariser, biasing solutions toward flat minima associated with better generalisation (Keskar et al., 2017). (ii) Mini-batch SGD performs more updates per epoch than full-batch GD with the same compute, and these noisy updates can escape sharp local minima and saddle points more readily.
S29. For __pow__(self, k) with constant $k$: out is $a^k$, local gradient is $k a^{k-1}$, so self.grad += k * self.data**(k-1) * out.grad. For exp: out is $e^a$, local gradient is $e^a$ itself, so self.grad += out.data * out.grad. For log: out is $\log a$, local gradient is $1/a$, so self.grad += (1.0 / self.data) * out.grad.
S33. All units in a symmetrically initialised layer compute the same function of the input, so they receive identical gradients during backprop and update by the same amount. Hence they remain identical for all time, and the layer is effectively rank-one regardless of how many units it nominally contains. Random initialisation is required to break this symmetry.
S34. A forward pass through layer $l$ does $d_l d_{l-1}$ multiplies and $d_l d_{l-1}$ adds (one matmul). The backward pass does two matmuls of the same shape: $\boldsymbol{\delta}^{(l-1)} = \mathbf{W}^{(l)\top} \boldsymbol{\delta}^{(l)}$ and $\partial \mathcal{L}/\partial \mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^\top$. Total backward FLOPs are roughly twice forward FLOPs.
S36. The probability that a $\mathcal{N}(-3, 1)$ random variable is positive is $\Phi(-3) \approx 0.0013$. The unit is active on barely 1 in 1000 inputs, and gradient flow through it is correspondingly weak. He initialisation assumes zero-mean activations; if the previous layer outputs negative-biased activations, the next layer's units suffer disproportionately. Hence the standard advice to apply normalisation (or at least input centring) to prevent activation drift.