3.14 Putting it all together: training a one-hidden-layer network by hand
Every theorem and identity in this chapter has been pointing at one thing: the moment when a neural network is trained. So far we have built the pieces in isolation — the gradient, the chain rule, the computational graph, reverse-mode automatic differentiation and gradient descent.
In this final section of the chapter we will put all five pieces together at once. We will pick the smallest neural network worth the name, one input, one hidden unit, one output, four parameters in total, and train it by hand on a single training example. Every number on this page can be checked with a pocket calculator. By the end of the section you will have performed every operation that takes place inside a modern training loop, on a network so small that nothing is hidden behind matrix shorthand. The reader who can follow the four steps below can, in principle, train any neural network. The networks of Chapter 9 and beyond will be larger, the matrices will be wider, the activation functions will be more interesting and the training data will be more abundant, but the calculation, line for line, will be the same.
It helps to know in advance what to look for. The forward pass is just function evaluation: you walk left to right through the architecture and compute one number at a time. The backward pass is the chain rule applied systematically: you walk right to left through the same graph, multiplying local derivatives, and you collect a gradient at every parameter you pass. The update is one line of subtraction per parameter. The verification is a second forward pass with the new parameters. None of these steps requires anything you have not already seen in this chapter. What is new is the choreography, the way the four steps fit together into a single repeatable cycle.
Chapter 9 returns to this picture with proper matrix notation, multilayer perceptrons, ReLU activations, mini-batches and the full backpropagation algorithm. For now we keep everything scalar so that the calculus is uncluttered.
Setup
Our network has one scalar input, one hidden unit with a sigmoid non-linearity, and one scalar output produced by a linear combination of the hidden activation. Written out, the architecture is
$$ h = \sigma(w_1 x + b_1), \qquad \hat y = w_2 h + b_2, $$
where $\sigma(z) = 1/(1 + e^{-z})$ is the logistic sigmoid that we met in §3.3 and §3.12. The first equation says: take the input, multiply by a weight, add a bias, then squash through the sigmoid. The second equation says: take that hidden activation, multiply by another weight, add another bias, and that is the prediction. There are four numbers, $w_1$, $b_1$, $w_2$ and $b_2$, that the network is free to choose, and training is the process of choosing them so that predictions match targets.
To measure the mismatch between prediction and target we use the squared-error loss
$$ \mathcal{L} = \tfrac{1}{2}(y - \hat y)^2. $$
The factor of one half is purely cosmetic: it cancels the factor of two that arises when we differentiate. The loss is zero when the prediction equals the target and grows quadratically as they diverge.
We need a training example. The simplest possible setup is one example: $(x, y) = (1, 0.5)$. The input is one and the target is one half. Because there is only one example, gradient descent on this example is identical to stochastic gradient descent on a dataset of size one, there is no averaging to worry about.
We need initial values for the four parameters. We pick them by hand: $w_1 = 0.5$, $b_1 = 0$, $w_2 = 1$ and $b_2 = 0$. In real training these would be drawn from a Gaussian or uniform distribution scaled by a small constant; here we just write them down. Finally, we choose the learning rate $\eta = 1.0$, which is large by the standards of modern training but acceptable for a single-parameter, single-example demonstration. The training loop has four steps, repeated until the loss is small enough: forward pass, backward pass, parameter update, and a check that the loss has decreased. We will perform one full cycle below, then say what happens if the cycle is repeated.
Step 1: Forward pass
The forward pass evaluates the network on the input. We follow the architecture from left to right, and we name every intermediate quantity, because in §3.7 we saw that reverse-mode AD needs those intermediates again on the way back. Three lines of arithmetic and one line for the loss complete the forward pass.
First the pre-activation of the hidden unit:
$$ z_1 = w_1 x + b_1 = 0.5 \cdot 1 + 0 = 0.5. $$
We multiplied the input by the weight, then added the bias. With our chosen values the pre-activation is exactly one half. Next we apply the sigmoid:
$$ h = \sigma(z_1) = \sigma(0.5) = \frac{1}{1 + e^{-0.5}}. $$
The denominator is $1 + e^{-0.5}$. From the definition of the exponential, $e^{-0.5} \approx 0.6065$, so the denominator is approximately $1.6065$. Dividing one by this gives $h \approx 0.6225$. The sigmoid has taken our pre-activation of $0.5$ and squashed it to roughly $0.6225$, which lies (as it must) in the open interval $(0, 1)$. This is the activation of the hidden unit.
Now the output:
$$ \hat y = w_2 h + b_2 = 1 \cdot 0.6225 + 0 = 0.6225. $$
Because $w_2 = 1$ and $b_2 = 0$, the output happens to coincide with the hidden activation. In a more general network the output would be a different number.
Finally the loss:
$$ \mathcal{L} = \tfrac{1}{2}(y - \hat y)^2 = \tfrac{1}{2}(0.5 - 0.6225)^2 = \tfrac{1}{2}(-0.1225)^2 \approx 0.0075. $$
The current prediction is too high by about 0.1225, so the squared error is roughly 0.015 and the half-squared error is about 0.0075. That number is small but non-zero. The job of training is to drive it lower.
Notice that the forward pass has produced a chain of intermediates: $z_1 = 0.5$, $h \approx 0.6225$, $\hat y \approx 0.6225$, $\mathcal{L} \approx 0.0075$. We will need every one of them on the way back. The forward pass is, at heart, just the evaluation of a composition of simple functions, exactly as in §3.6.
Step 2: Backward pass
The backward pass computes the partial derivative of the loss with respect to each parameter. We traverse the computational graph in the reverse direction, multiplying local derivatives as we go, and accumulating the gradient on each parameter. Every local derivative is something we have already met in §3.3 or §3.12.
Begin at the output. The loss depends on the prediction through $\mathcal{L} = \tfrac{1}{2}(y - \hat y)^2$. Differentiating with respect to $\hat y$,
$$ \frac{\partial \mathcal{L}}{\partial \hat y} = -(y - \hat y) = -(0.5 - 0.6225) = 0.1225. $$
The minus sign comes from the chain rule on $(y - \hat y)$ as a function of $\hat y$. The number 0.1225 is the residual: it is positive because we over-predicted.
The prediction depends on $w_2$, $b_2$ and $h$ through $\hat y = w_2 h + b_2$. The local derivatives are immediate:
$$ \frac{\partial \hat y}{\partial w_2} = h, \qquad \frac{\partial \hat y}{\partial b_2} = 1, \qquad \frac{\partial \hat y}{\partial h} = w_2. $$
Multiplying by $\partial \mathcal{L}/\partial \hat y$ in each case,
$$ \frac{\partial \mathcal{L}}{\partial w_2} = 0.1225 \cdot 0.6225 \approx 0.0762, \quad \frac{\partial \mathcal{L}}{\partial b_2} = 0.1225 \cdot 1 = 0.1225, \quad \frac{\partial \mathcal{L}}{\partial h} = 0.1225 \cdot 1 = 0.1225. $$
Two of the four parameter gradients are now known: those for $w_2$ and $b_2$. The third quantity, $\partial \mathcal{L}/\partial h$, is not a parameter gradient, it is the adjoint of the hidden activation, a quantity that propagates further back into the network.
Continue back through the sigmoid. The hidden activation depends on the pre-activation through $h = \sigma(z_1)$. The derivative of the sigmoid was derived in §3.12: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. With $h = \sigma(z_1)$ this is
$$ \frac{\partial h}{\partial z_1} = h(1 - h) = 0.6225 \cdot 0.3775 \approx 0.2350. $$
This is a tidy identity: the derivative of the sigmoid at $z_1$ is the activation times one minus the activation, so we do not need to recompute the sigmoid. Multiplying,
$$ \frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial h} \cdot \frac{\partial h}{\partial z_1} = 0.1225 \cdot 0.2350 \approx 0.0288. $$
This is the adjoint of $z_1$. It tells us by how much the loss would change if we perturbed the pre-activation by a unit amount. Note that the adjoint has shrunk: from 0.1225 at the output to 0.0288 at the pre-activation, because the sigmoid has a local slope of only about a quarter at this point. This is the famous saturation effect: when sigmoid units are not near zero, gradients flowing back through them are attenuated.
Finally, the pre-activation depends on the remaining two parameters through $z_1 = w_1 x + b_1$, with local derivatives
$$ \frac{\partial z_1}{\partial w_1} = x = 1, \qquad \frac{\partial z_1}{\partial b_1} = 1. $$
Multiplying through,
$$ \frac{\partial \mathcal{L}}{\partial w_1} = 0.0288 \cdot 1 \approx 0.0288, \qquad \frac{\partial \mathcal{L}}{\partial b_1} = 0.0288 \cdot 1 \approx 0.0288. $$
All four parameter gradients are now in hand: $\partial \mathcal{L}/\partial w_1 \approx 0.0288$, $\partial \mathcal{L}/\partial b_1 \approx 0.0288$, $\partial \mathcal{L}/\partial w_2 \approx 0.0762$ and $\partial \mathcal{L}/\partial b_2 = 0.1225$. Each one is a single product of numbers we already computed during the forward pass and the backward pass. This is the substance of reverse-mode automatic differentiation: a backward sweep through the graph, multiplying local derivatives, with one parameter gradient produced at every parameter node.
Step 3: SGD update with $\eta = 1.0$
Stochastic gradient descent is the rule that turns gradients into a parameter update. Each parameter is moved a small step opposite to its gradient, with the step size controlled by the learning rate $\eta$. The update rule is identical for every parameter:
$$ \theta \leftarrow \theta - \eta \cdot \frac{\partial \mathcal{L}}{\partial \theta}. $$
The minus sign is essential. The gradient points in the direction in which the loss increases fastest; we want the loss to decrease, so we go the other way. The learning rate scales the step. Too small and progress is slow; too large and we may overshoot the minimum entirely. With our chosen learning rate of $\eta = 1.0$ the update is just "subtract the gradient".
Apply the rule to each of the four parameters in turn. The first hidden weight:
$$ w_1 \leftarrow 0.5 - 1.0 \cdot 0.0288 = 0.4712. $$
The first hidden bias:
$$ b_1 \leftarrow 0 - 1.0 \cdot 0.0288 = -0.0288. $$
The output weight:
$$ w_2 \leftarrow 1 - 1.0 \cdot 0.0762 = 0.9238. $$
The output bias:
$$ b_2 \leftarrow 0 - 1.0 \cdot 0.1225 = -0.1225. $$
Every parameter has been nudged by a small amount in the direction the chain rule judged would reduce the loss. The output bias moved the most, by 0.1225, because it had the largest gradient, its local derivative with respect to $\hat y$ is exactly one, so it inherits the residual unattenuated. The hidden parameters $w_1$ and $b_1$ moved the least, by 0.0288, because the sigmoid attenuated the gradient as it flowed back. The output weight $w_2$ took an intermediate step.
This is gradient descent in its purest form, exactly as defined in §3.9. There is no momentum here, no Adam, no weight decay. Every modern optimiser is a refinement of this single line, and the refinements only matter once we are training on many examples or many parameters.
Step 4: Verify the loss decreased
Gradient descent only earns its name if the loss actually decreased. It is good discipline, especially while learning, to check. We do that by running the forward pass again, with the new parameters, and comparing the loss with what we had before.
The new pre-activation:
$$ z_1 = 0.4712 \cdot 1 + (-0.0288) = 0.4424. $$
The pre-activation has dropped slightly, from 0.5 to 0.4424, because both $w_1$ and $b_1$ moved by the same amount in the same direction.
The new hidden activation:
$$ h = \sigma(0.4424). $$
Computing carefully: $e^{-0.4424} \approx 0.6425$, so the denominator $1 + e^{-0.4424} \approx 1.6425$, and $h \approx 1/1.6425 \approx 0.6088$. The hidden activation has dropped slightly from 0.6225 to 0.6088.
The new prediction:
$$ \hat y = 0.9238 \cdot 0.6088 + (-0.1225) \approx 0.5624 - 0.1225 \approx 0.4400. $$
This is the moment of truth. The old prediction was 0.6225, missing the target of 0.5 by about $+0.1225$. The new prediction is 0.4400, missing the target by about $-0.0600$. We have actually overshot in the opposite direction, the prediction is now too low, but we have overshot from 0.1225 to only 0.0600, so the absolute residual is much smaller.
The new loss:
$$ \mathcal{L}_{\text{new}} = \tfrac{1}{2}(0.5 - 0.4400)^2 \approx 0.0018. $$
The loss has fallen from approximately 0.0075 to approximately 0.0018, a reduction of about 75 per cent in a single step. Stochastic gradient descent is working exactly as the calculus predicted. The fact that we briefly overshot the target is harmless: with a smaller learning rate we would have approached more cautiously, but we would have reached a small loss in more steps rather than fewer.
Iterate to convergence
A single step is not training. Training is the same step performed over and over until the loss is acceptable. To iterate, we feed the new parameters back into the forward pass, recompute the gradients with respect to the new prediction, perform another update, and continue. The loss will not always fall by 75 per cent on each step, early steps tend to make rapid progress and later steps make finer corrections, but on a problem this small the trajectory converges quickly. After roughly fifty iterations the loss is essentially zero: the network has memorised our single training example almost exactly.
Two refinements scale this picture up to real training. The first concerns multiple training examples. With a dataset of $N$ examples, the loss is the sum (or average) of per-example losses, and by linearity of differentiation the gradient is the sum (or average) of per-example gradients. In practice we do not visit all $N$ examples per update; we draw a mini-batch of size, say, 32 or 256, average gradients within the mini-batch, and update. The mathematics of each step is identical to the calculation above, with the addition of a sum and a division by the batch size.
The second concerns multiple parameters. With many weights and biases organised into matrices, the local derivatives become Jacobians and the chain-rule products become matrix multiplications. Reverse-mode AD handles this automatically and at the same cost as a forward pass. The four-parameter calculation we just did becomes a million-parameter calculation, but every single multiplication has the same character as the multiplications above.
A practical remark before we close. In modern frameworks such as PyTorch or JAX you do not perform the backward pass by hand: you build the forward graph, call a method that triggers reverse-mode AD, and read the parameter gradients off automatically. The reason the manual calculation is still worth doing once is that, when training fails, when the loss does not decrease, or grows, or oscillates, the only way to diagnose the failure is to know exactly what the framework is doing on your behalf. Every line above is what the framework computes, and being able to picture those lines is the difference between a useful intuition and a black box.
What you should take away
- Every neural network training step is the same three-part cycle: forward pass to evaluate the prediction and the loss, backward pass to compute the gradient of the loss with respect to every parameter, and an update step that moves each parameter opposite to its gradient.
- The chain rule does all the work in the backward pass. Each local derivative is something an undergraduate calculus student can write down; the chain rule is what stitches them together into a gradient for every parameter, no matter how deep the network.
- One step of stochastic gradient descent shrinks the loss by a small (or, on early steps, sometimes large) amount. Here the loss fell from approximately 0.0075 to approximately 0.0018, a reduction of roughly 75 per cent on the first step.
- Thousands or millions of such steps drive the loss from chance level to near-zero. Convergence is the cumulative effect of a great many small, near-greedy improvements, each justified by the local linearisation that the gradient provides.
- Bigger networks trained on bigger datasets are, mathematically, exactly this calculation, scaled. The matrices grow, the activation functions vary, the optimiser may add momentum or adaptive scaling, but the forward–backward–update loop is the same loop you have just performed by hand.