Neural Networks: 9.7   A backpropagation worked example by hand

Dr Chris Paton

9.7 A backpropagation worked example by hand

The previous section developed backpropagation as a piece of mathematics: we wrote down the loss, applied the chain rule layer by layer, and arrived at a recurrence that lets us compute every gradient by passing one error signal backwards through the network. That derivation can feel abstract on first reading, because the indices and superscripts pile up faster than any intuition can keep pace with. The remedy, as so often in mathematics, is to do the calculation on a concrete example. In this section we take a tiny network with just six weights, three biases, and one training example, and we run the entire algorithm using nothing but pen and paper. By the end you will have computed the gradient of the loss with respect to every parameter, applied a single step of stochastic gradient descent, and verified that the loss has gone down. If you can follow the arithmetic in this section, you understand backpropagation. The remainder of the chapter, and indeed the bulk of modern deep learning, is just the same algorithm running on bigger matrices.

This section is the bridge between Section 9.6, where backpropagation was derived from first principles, and Section 9.8, where the same calculation is implemented in roughly eighty lines of NumPy and scaled up to handle thousands of training examples per second on a real dataset. The arithmetic here is identical to what NumPy will be doing inside the training loop; the only difference is the size of the matrices and the speed of the processor. Working through the small case once, by hand, is the surest way to demystify what an autograd library is doing on your behalf. The exercise is also the best preparation for the day you will eventually have, when a model trains badly and you need to inspect intermediate values to figure out which layer is misbehaving. People who have done this calculation know what to look for; people who have not have to take the framework's word for everything.

Symbols Used Here

$\mathbf{x}$input vector

$y$target output (scalar in this example)

$\mathbf{W}^{(1)}$$\mathbf{b}^{(1)}$, weights and bias of hidden layer

$\mathbf{W}^{(2)}$$b^{(2)}$, weights and bias of output layer

$\mathbf{z}^{(\ell)}$pre-activation at layer $\ell$

$\mathbf{a}^{(\ell)}$activation at layer $\ell$

$\hat y$predicted output: $\hat y = \mathbf{a}^{(2)}$

$\sigma(z) = 1/(1+e^{-z})$sigmoid activation

$\sigma'(z) = \sigma(z)(1 - \sigma(z))$derivative of sigmoid

$\mathcal{L}$loss; here $\mathcal{L} = \tfrac{1}{2}(y - \hat y)^2$ (squared-error)

$\boldsymbol{\delta}^{(\ell)}$error signal at layer $\ell$, shape $(d_\ell,)$

$\nabla_{\mathbf{W}^{(\ell)}} \mathcal{L}$gradient of $\mathcal{L}$ with respect to $\mathbf{W}^{(\ell)}$

$\eta$learning rate

$\odot$element-wise (Hadamard) product

The network we will train by hand

The network has two inputs, two hidden neurons with sigmoid activations, and a single output neuron with a sigmoid activation. This 2-2-1 architecture is the smallest network that exhibits every feature of a deeper one: a non-linear hidden layer, a non-linear output layer, two coupled weight matrices and two bias vectors, and a non-trivial chain rule that has to thread its way back through both layers. Anything smaller than this, such as a 2-1 network with no hidden layer, would degenerate into a logistic regression and would not exercise the recurrence at the heart of backpropagation. Once you can backpropagate through this, the same algorithm with the same equations applies to a network with a hundred layers; the matrices simply get larger and the for-loop runs more iterations.

We start from the following parameter values. These exact numbers will be used throughout, so it is worth copying them onto a piece of scratch paper as you read.

$$\mathbf{W}^{(1)} = \begin{pmatrix} 0.5 & -0.3 \\ 0.2 & 0.8 \end{pmatrix}, \qquad \mathbf{b}^{(1)} = \begin{pmatrix} 0.1 \\ -0.2 \end{pmatrix}$$

$$\mathbf{W}^{(2)} = \begin{pmatrix} 0.7 & -0.5 \end{pmatrix}, \qquad b^{(2)} = 0.05$$

The training example is $\mathbf{x} = (1, 0)^\top$ with target $y = 1$. You can read this concretely if it helps: the network is being told that whenever the first feature is on and the second is off, the correct answer is one. A single training example is not enough to learn anything in general, but it is enough to demonstrate one full step of the algorithm. The loss is half the squared error, $\mathcal{L} = \tfrac{1}{2}(y - \hat y)^2$. The factor of $\tfrac{1}{2}$ is a cosmetic convention that makes the derivative come out neat: differentiating $\tfrac{1}{2}(y-\hat y)^2$ with respect to $\hat y$ gives $-(y-\hat y)$, with no stray factor of two left over to clutter the algebra. The learning rate for our single SGD step is $\eta = 0.5$, which is large by the standards of real training (typical values for neural networks lie between $10^{-4}$ and $10^{-2}$) but convenient for showing a visible change in one step. With a smaller $\eta$ the loss would still decrease, just by a much smaller amount.

Our goal in the rest of the section is straightforward. We will run the forward pass to produce a prediction $\hat y$ and a loss $\mathcal{L}$. We will then compute, by hand, the partial derivative of $\mathcal{L}$ with respect to every entry of $\mathbf{W}^{(1)}$, $\mathbf{b}^{(1)}$, $\mathbf{W}^{(2)}$ and $b^{(2)}$. That is nine numbers in total: four entries in $\mathbf{W}^{(1)}$, two in $\mathbf{b}^{(1)}$, two in $\mathbf{W}^{(2)}$, and one for $b^{(2)}$. We will apply one step of stochastic gradient descent to all of those parameters at once. Finally we will run the forward pass again with the updated parameters and check that the loss has dropped. Each of these stages is a separate subsection below.

Step 1: Forward pass

The forward pass is just a sequence of matrix multiplications and element-wise activations. We start with $\mathbf{x} = (1, 0)^\top$ and propagate forwards.

The hidden-layer pre-activation is $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$. Writing out the two components of that matrix-vector product separately:

$z^{(1)}_1 = 0.5 \cdot 1 + (-0.3) \cdot 0 + 0.1 = 0.5 + 0 + 0.1 = 0.6$

$z^{(1)}_2 = 0.2 \cdot 1 + 0.8 \cdot 0 + (-0.2) = 0.2 + 0 - 0.2 = 0$

So $\mathbf{z}^{(1)} = (0.6, 0)^\top$. The second component happens to be exactly zero because the only contribution from $x_1 = 1$ is $0.2$, and the bias $-0.2$ cancels it. This is an accident of the chosen numbers, but it makes the arithmetic in later steps slightly easier.

The hidden-layer activation is $\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$, applying the sigmoid element-wise. The sigmoid function is $\sigma(z) = 1/(1 + e^{-z})$.

$\sigma(0.6)$: we need $e^{-0.6}$. Using the standard value $e^{-0.6} \approx 0.5488$, we get $\sigma(0.6) = 1/(1 + 0.5488) = 1/1.5488 \approx 0.6457$.

$\sigma(0) = 1/(1 + e^{0}) = 1/(1 + 1) = 1/2 = 0.5$. The sigmoid passes through $(0, 0.5)$, so this is exact.

So $\mathbf{a}^{(1)} = (0.6457, 0.5)^\top$.

Now propagate to the output layer. The output pre-activation is the scalar $z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)}$:

$z^{(2)} = 0.7 \cdot 0.6457 + (-0.5) \cdot 0.5 + 0.05 = 0.4520 - 0.2500 + 0.0500 = 0.2520$

The output activation is $\hat y = \sigma(z^{(2)}) = \sigma(0.2520)$. We need $e^{-0.2520}$. Using $e^{-0.2520} \approx 0.7773$, we have $\sigma(0.2520) = 1/(1 + 0.7773) = 1/1.7773 \approx 0.5627$.

Therefore $\hat y \approx 0.5627$.

The loss is $\mathcal{L} = \tfrac{1}{2}(y - \hat y)^2 = \tfrac{1}{2}(1 - 0.5627)^2 = \tfrac{1}{2}(0.4373)^2 = \tfrac{1}{2} \cdot 0.1912 \approx 0.0956$.

A quick sanity check is in order before we move on. The output sigmoid means $\hat y$ must lie between 0 and 1; our value of $0.5627$ is in range, so nothing has gone obviously wrong. The target is $y = 1$, so the network's prediction is undershooting by $0.4373$. That is a meaningful error, large enough that one SGD step should shift things noticeably. If $\hat y$ had come out close to 1, we would already be near the target and the gradient would be small; if it had come out close to 0, the gradient would also be small because $\sigma'$ vanishes near both extremes. A mid-range output like $0.5627$ is roughly where the sigmoid's derivative is largest and where SGD takes the most decisive steps. This is one of several reasons why initialisation schemes for neural networks aim to put the pre-activations in the middle of the activation function's dynamic range; we will return to this idea in Section 9.10.

We now have everything we need to start the backward pass: the inputs, the pre-activations, the activations, the prediction, and the loss. Backpropagation reuses every one of these intermediate values, which is why a real implementation caches them during the forward pass rather than recomputing them. In a NumPy implementation each intermediate quantity becomes an array stored on the call stack; in a framework like PyTorch each one becomes a node on the computation graph that the backward pass will traverse in reverse.

Step 2: Output-layer error

Backpropagation is just the chain rule applied carefully. We want to know how the loss depends on the output pre-activation $z^{(2)}$, because once we know that, the gradients of the output-layer parameters $\mathbf{W}^{(2)}$ and $b^{(2)}$ follow immediately. The quantity $\delta^{(2)} = \partial \mathcal{L} / \partial z^{(2)}$ is what we are after.

Decompose it into two factors using the chain rule:

$\delta^{(2)} = \dfrac{\partial \mathcal{L}}{\partial z^{(2)}} = \dfrac{\partial \mathcal{L}}{\partial \hat y} \cdot \dfrac{\partial \hat y}{\partial z^{(2)}}$

The first factor is the derivative of the loss with respect to the prediction. Since $\mathcal{L} = \tfrac{1}{2}(y - \hat y)^2$, differentiating with respect to $\hat y$ gives $-(y - \hat y)$:

$\dfrac{\partial \mathcal{L}}{\partial \hat y} = -(y - \hat y) = -(1 - 0.5627) = -0.4373$

The second factor is the derivative of the sigmoid at $z^{(2)}$. The sigmoid has the convenient property that $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, so we can express the derivative entirely in terms of the activation we already computed:

$\dfrac{\partial \hat y}{\partial z^{(2)}} = \sigma'(z^{(2)}) = \hat y (1 - \hat y) = 0.5627 \cdot (1 - 0.5627) = 0.5627 \cdot 0.4373 \approx 0.2461$

Multiplying:

$\delta^{(2)} = (-0.4373) \cdot 0.2461 \approx -0.1076$

The output-layer error is $\delta^{(2)} \approx -0.1076$. Take a moment to interpret this number. It tells us how much the loss changes for a small change in $z^{(2)}$. The negative sign means that increasing $z^{(2)}$ decreases the loss, which is exactly what we want, because $z^{(2)} = 0.2520$ at present and pushing it upwards drives $\hat y$ closer to the target value of 1. SGD will exploit this by adjusting the parameters in the direction that increases $z^{(2)}$.

The output-layer parameter gradients now drop out almost for free. The output pre-activation is $z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)}$, so $\partial z^{(2)} / \partial \mathbf{W}^{(2)} = (\mathbf{a}^{(1)})^\top$ and $\partial z^{(2)} / \partial b^{(2)} = 1$. Multiplying by $\delta^{(2)}$:

$\nabla_{\mathbf{W}^{(2)}} \mathcal{L} = \delta^{(2)} \cdot (\mathbf{a}^{(1)})^\top = -0.1076 \cdot (0.6457, 0.5) = (-0.0695, -0.0538)$

$\nabla_{b^{(2)}} \mathcal{L} = \delta^{(2)} = -0.1076$

Both gradient entries of $\mathbf{W}^{(2)}$ are negative, and so is the bias gradient, telling us that all three output-layer parameters should be increased (because SGD subtracts the gradient). That makes intuitive sense: increasing any of these parameters increases $z^{(2)}$, which increases $\hat y$, which moves the prediction towards the target $y = 1$. Backpropagation has rediscovered, automatically, what the algebra was telling us a paragraph ago.

Step 3: Hidden-layer error

We now propagate the error backwards through the output weights to find the hidden-layer error vector $\boldsymbol{\delta}^{(1)} = \partial \mathcal{L} / \partial \mathbf{z}^{(1)}$. The general backpropagation recurrence from Section 9.6 says

$\boldsymbol{\delta}^{(1)} = \big( (\mathbf{W}^{(2)})^\top \delta^{(2)} \big) \odot \sigma'(\mathbf{z}^{(1)})$

The left factor reweights the output error by the output weights, sending each unit's share of the blame back to the hidden neurons that fed into it. The right factor multiplies element-wise by the local slope of the hidden activation, because a hidden neuron sitting on a flat part of its sigmoid contributes very little to the output regardless of how its weights change.

Compute the left factor first. $\mathbf{W}^{(2)} = (0.7, -0.5)$, so $(\mathbf{W}^{(2)})^\top$ is the same numbers laid out as a column. Multiplying by the scalar $\delta^{(2)} = -0.1076$:

$(\mathbf{W}^{(2)})^\top \delta^{(2)} = \begin{pmatrix} 0.7 \\ -0.5 \end{pmatrix} \cdot (-0.1076) = \begin{pmatrix} 0.7 \cdot (-0.1076) \\ -0.5 \cdot (-0.1076) \end{pmatrix} = \begin{pmatrix} -0.0753 \\ 0.0538 \end{pmatrix}$

Now the right factor, $\sigma'(\mathbf{z}^{(1)})$, computed element-wise from $\mathbf{z}^{(1)} = (0.6, 0)^\top$ and the values of $\mathbf{a}^{(1)}$ already in hand. Using $\sigma'(z) = \sigma(z)(1 - \sigma(z)) = a(1 - a)$:

$\sigma'(0.6) = 0.6457 \cdot (1 - 0.6457) = 0.6457 \cdot 0.3543 \approx 0.2287$

$\sigma'(0) = 0.5 \cdot (1 - 0.5) = 0.5 \cdot 0.5 = 0.25$

The Hadamard product combines the two component-wise:

$\boldsymbol{\delta}^{(1)} = \begin{pmatrix} -0.0753 \\ 0.0538 \end{pmatrix} \odot \begin{pmatrix} 0.2287 \\ 0.2500 \end{pmatrix} = \begin{pmatrix} -0.0753 \cdot 0.2287 \\ 0.0538 \cdot 0.2500 \end{pmatrix} = \begin{pmatrix} -0.0172 \\ 0.01345 \end{pmatrix}$

So $\boldsymbol{\delta}^{(1)} \approx (-0.0172, 0.01345)^\top$. Notice how much smaller these numbers are than $\delta^{(2)}$. The hidden error has been attenuated both by the modest output weights and by the sigmoid derivatives, which are bounded above by $0.25$. This shrinkage is the seed of the vanishing-gradient problem we will revisit in Section 9.11; in deep networks the same multiplications happen at every layer and the signal can fade to nothing.

The hidden-layer parameter gradients follow from the same identity used at the output layer. We have $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$, so

$\nabla_{\mathbf{W}^{(1)}} \mathcal{L} = \boldsymbol{\delta}^{(1)} \mathbf{x}^\top$

The outer product of a $2 \times 1$ column with a $1 \times 2$ row gives a $2 \times 2$ matrix. With $\mathbf{x} = (1, 0)^\top$ this is

$\nabla_{\mathbf{W}^{(1)}} \mathcal{L} = \begin{pmatrix} -0.0172 \\ 0.01345 \end{pmatrix} \cdot (1, 0) = \begin{pmatrix} -0.0172 \cdot 1 & -0.0172 \cdot 0 \\ 0.01345 \cdot 1 & 0.01345 \cdot 0 \end{pmatrix} = \begin{pmatrix} -0.0172 & 0 \\ 0.01345 & 0 \end{pmatrix}$

The right column is all zeros because $x_2 = 0$ contributed nothing to either hidden pre-activation, so changing the weights that multiply $x_2$ cannot change the loss for this particular training example. With a different input vector, those entries would be non-zero.

The hidden bias gradient is just $\boldsymbol{\delta}^{(1)}$ itself, since $\partial \mathbf{z}^{(1)} / \partial \mathbf{b}^{(1)}$ is the identity matrix:

$\nabla_{\mathbf{b}^{(1)}} \mathcal{L} = \boldsymbol{\delta}^{(1)} = \begin{pmatrix} -0.0172 \\ 0.01345 \end{pmatrix}$

We now have the gradient of the loss with respect to every parameter. That is the entire purpose of backpropagation, and we have done it with no machinery beyond the chain rule, sigmoid arithmetic, and a few matrix-vector products. Crucially, we did not need to compute each parameter's gradient from scratch. The error signal $\delta^{(2)}$ was reused to produce both the output-layer gradients and (via the recurrence) the hidden-layer error $\boldsymbol{\delta}^{(1)}$, which in turn produced the hidden-layer gradients. That reuse is what gives backpropagation its computational efficiency: the cost of computing all the gradients is roughly the same as the cost of one extra forward pass, no matter how many parameters the network has.

Step 4: One SGD update

Stochastic gradient descent applies the update rule $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$ to every parameter, using the gradient we just computed. The learning rate is $\eta = 0.5$.

For the hidden weight matrix:

$\mathbf{W}^{(1)}_{\text{new}} = \begin{pmatrix} 0.5 & -0.3 \\ 0.2 & 0.8 \end{pmatrix} - 0.5 \cdot \begin{pmatrix} -0.0172 & 0 \\ 0.01345 & 0 \end{pmatrix}$

Computing each entry of $0.5 \cdot \nabla_{\mathbf{W}^{(1)}} \mathcal{L}$ first: $0.5 \cdot (-0.0172) = -0.0086$ and $0.5 \cdot 0.01345 = 0.006725$. The other two entries are zero. Subtracting:

$\mathbf{W}^{(1)}_{\text{new}} = \begin{pmatrix} 0.5 - (-0.0086) & -0.3 - 0 \\ 0.2 - 0.006725 & 0.8 - 0 \end{pmatrix} = \begin{pmatrix} 0.5086 & -0.3 \\ 0.1933 & 0.8 \end{pmatrix}$

For the hidden bias:

$\mathbf{b}^{(1)}_{\text{new}} = \begin{pmatrix} 0.1 \\ -0.2 \end{pmatrix} - 0.5 \cdot \begin{pmatrix} -0.0172 \\ 0.01345 \end{pmatrix} = \begin{pmatrix} 0.1 - (-0.0086) \\ -0.2 - 0.006725 \end{pmatrix} = \begin{pmatrix} 0.1086 \\ -0.2067 \end{pmatrix}$

For the output weights, $0.5 \cdot \nabla_{\mathbf{W}^{(2)}} \mathcal{L} = 0.5 \cdot (-0.0695, -0.0538) = (-0.03475, -0.0269)$:

$\mathbf{W}^{(2)}_{\text{new}} = (0.7, -0.5) - (-0.03475, -0.0269) = (0.7348, -0.4731)$

For the output bias:

$b^{(2)}_{\text{new}} = 0.05 - 0.5 \cdot (-0.1076) = 0.05 - (-0.0538) = 0.05 + 0.0538 = 0.1038$

Collecting the new parameter values for reference:

$\mathbf{W}^{(1)}_{\text{new}} = \begin{pmatrix} 0.5086 & -0.3 \\ 0.1933 & 0.8 \end{pmatrix}, \quad \mathbf{b}^{(1)}_{\text{new}} = \begin{pmatrix} 0.1086 \\ -0.2067 \end{pmatrix}$

$\mathbf{W}^{(2)}_{\text{new}} = (0.7348, -0.4731), \quad b^{(2)}_{\text{new}} = 0.1038$

Notice the pattern: every parameter has moved in the opposite direction from its gradient. The two parameters with zero gradient have not moved at all, because for this particular input $x_2 = 0$, those weights had no effect on the loss and SGD has nothing to say about them. They will be updated on a future training example where $x_2 \neq 0$.

Step 5: Verify the loss decreased

The whole point of an SGD step is that the loss should drop. Let us check. We re-run the forward pass with the new parameters and compute the new loss.

The new hidden pre-activations:

$z^{(1)}_{1,\text{new}} = 0.5086 \cdot 1 + (-0.3) \cdot 0 + 0.1086 = 0.5086 + 0 + 0.1086 = 0.6172$

$z^{(1)}_{2,\text{new}} = 0.1933 \cdot 1 + 0.8 \cdot 0 + (-0.2067) = 0.1933 + 0 - 0.2067 = -0.0134$

The new hidden activations:

$a^{(1)}_{1,\text{new}} = \sigma(0.6172)$: with $e^{-0.6172} \approx 0.5395$, we get $1/(1 + 0.5395) = 1/1.5395 \approx 0.6496$.

$a^{(1)}_{2,\text{new}} = \sigma(-0.0134)$: with $e^{0.0134} \approx 1.0135$, we get $1/(1 + 1.0135) = 1/2.0135 \approx 0.4966$.

The new output pre-activation:

$z^{(2)}_{\text{new}} = 0.7348 \cdot 0.6496 + (-0.4731) \cdot 0.4966 + 0.1038$

Multiplying out: $0.7348 \cdot 0.6496 \approx 0.4774$ and $-0.4731 \cdot 0.4966 \approx -0.2349$.

$z^{(2)}_{\text{new}} = 0.4774 - 0.2349 + 0.1038 = 0.3463$

The new prediction: $\hat y_{\text{new}} = \sigma(0.3463)$. With $e^{-0.3463} \approx 0.7073$, we get $1/(1 + 0.7073) = 1/1.7073 \approx 0.5857$.

The new loss:

$\mathcal{L}_{\text{new}} = \tfrac{1}{2}(1 - 0.5857)^2 = \tfrac{1}{2}(0.4143)^2 = \tfrac{1}{2} \cdot 0.1716 \approx 0.0858$

Compare. The old loss was $0.0956$. The new loss is $0.0858$. The loss has dropped by about $0.0098$, or roughly ten per cent of its starting value, in a single step. The prediction has moved from $0.5627$ to $0.5857$, that is, $0.023$ closer to the target value of $1$.

This is exactly what backpropagation promises: by computing the gradient and stepping in its negative direction, the loss reduces. We did not have to be clever about which parameter to change first or by how much; the gradient told us in one calculation. A real training run repeats this process for thousands of training examples and thousands of steps. Each individual step might decrease, hold, or even slightly increase the loss on a particular minibatch (because the gradient is only an average over that minibatch, not the whole dataset). But the expected direction is downward, and over many steps the loss falls from its starting value to something close to zero.

A small caveat for the careful reader. If you redo this calculation with a calculator that keeps more decimal places than four, you will get answers that differ in the third or fourth decimal place from what is shown here. That is rounding, not error. The verification that matters is the comparison of the two losses: $0.0858 < 0.0956$. The arithmetic is robust to small rounding choices and the conclusion does not change. If you implement the same network in NumPy with double-precision floats, you will get values that agree with ours to about three decimal places and that of course move in the same direction.

It is also worth observing that one step does not solve the problem. The prediction is still well below the target, and the loss is far from zero. To get $\hat y$ to within, say, $0.01$ of the target, we would need many more SGD steps. With a single training example and the same loss function, we could keep iterating and the loss would continue to fall, asymptoting towards zero. With a real dataset and many examples, the loss instead asymptotes to a non-zero value reflecting the irreducible error of the task and the capacity of the model. Training proceeds until the loss either flattens out or until a held-out validation loss starts to increase, signalling overfitting.

What you should take away

Backpropagation is mechanical chain-rule arithmetic. There is no clever insight hiding in any individual step; we just compute one partial derivative after another, working backwards from the loss to the parameters.
The per-layer error $\boldsymbol{\delta}^{(\ell)}$ is the central reusable quantity. Once we have it, the gradients of that layer's weights and biases drop out trivially, and it feeds straight into the next layer back through the recurrence $\boldsymbol{\delta}^{(\ell-1)} = ((\mathbf{W}^{(\ell)})^\top \boldsymbol{\delta}^{(\ell)}) \odot \sigma'(\mathbf{z}^{(\ell-1)})$. Reusing $\boldsymbol{\delta}$ is what makes backpropagation efficient compared with computing each parameter gradient from scratch.
One SGD step decreases the loss by a small amount. Our step took the loss from $0.0956$ to $0.0858$, which is real progress but nowhere near solving the problem. Training neural networks is a long sequence of such modest improvements.
Thousands of small steps add up. Repeating this process across thousands of training examples and many epochs is what drives the loss from chance-level down to near-zero, which is how the same algorithm trains networks with billions of parameters on web-scale corpora.
Modern frameworks automate every step in this section. PyTorch, JAX and TensorFlow build a computational graph during the forward pass and traverse it backwards on a single call to .backward(). You will essentially never differentiate by hand again. But understanding what those frameworks are doing, in the form of the calculation above, is essential for diagnosing the things that go wrong during training: vanishing gradients, exploding gradients, dead neurons, learning rates set at the wrong order of magnitude, and the many other failure modes we will meet in the remainder of this chapter.