Neural Networks: 9.8   Implementing backprop from scratch in NumPy

Dr Chris Paton

9.8 Implementing backprop from scratch in NumPy

Section 9.7 walked, by hand, through one forward pass and one backward pass for a tiny two-layer network on a single training example. Every product was multiplied out on paper; every partial derivative was written on its own line; the chain rule was applied in plain sight. That kind of exercise is invaluable for building intuition, but it is plainly not a way to train a real model. A practical network has thousands of parameters, sees tens of thousands of examples per epoch, and needs to repeat the same forward-then-backward sweep millions of times. Section 9.8 takes the hand calculation of the previous section and turns it into roughly eighty lines of NumPy that scale gracefully to mini-batches of any size. The algorithm does not change at all. What changes is the bookkeeping: instead of one example flowing through one neuron, a whole batch of examples flows through entire layers in a single matrix multiply, and the gradients are accumulated across the batch in a single sum.

This section translates the equations of §9.6 into vectorised NumPy that you could paste into a notebook and run. §9.16 plugs the code into MNIST and reports a training curve reaching around 97 per cent test accuracy with no library help beyond NumPy.

A complete from-scratch implementation

import numpy as np

def init_params(layer_sizes, seed=0):
    rng = np.random.default_rng(seed)
    params = []
    for n_in, n_out in zip(layer_sizes[:-1], layer_sizes[1:]):
        W = rng.standard_normal((n_in, n_out)) * np.sqrt(2.0 / n_in)  # He init
        b = np.zeros(n_out)
        params.append((W, b))
    return params

def relu(z):
    return np.maximum(0, z)

def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def forward(X, params):
    activations = [X]
    pre_activations = []
    for i, (W, b) in enumerate(params):
        Z = activations[-1] @ W + b
        pre_activations.append(Z)
        A = softmax(Z) if i == len(params) - 1 else relu(Z)
        activations.append(A)
    return activations, pre_activations

def cross_entropy(Y_hat, Y):
    return -np.mean(np.sum(Y * np.log(Y_hat + 1e-12), axis=1))

def backward(activations, pre_activations, Y, params):
    grads = []
    B = Y.shape[0]
    Y_hat = activations[-1]
    delta = (Y_hat - Y) / B
    for i in reversed(range(len(params))):
        A_prev = activations[i]
        dW = A_prev.T @ delta
        db = delta.sum(axis=0)
        grads.append((dW, db))
        if i > 0:
            W, _ = params[i]
            delta = (delta @ W.T) * (pre_activations[i-1] > 0).astype(np.float32)
    return list(reversed(grads))

init_params builds a list of weight-and-bias tuples, one per layer. The shapes follow the batch-axis-first convention used by every modern framework: a weight matrix of shape (n_in, n_out) maps a row vector of n_in activations into a row vector of n_out pre-activations, and a bias vector of length n_out is broadcast across the batch. He initialisation, sampling from a Gaussian with variance 2/n_in, keeps the variance of activations roughly constant from layer to layer when ReLU is the non-linearity; we examine this choice in detail in Section 9.10. The biases are initialised to zero, which is harmless because ReLU breaks the symmetry that zero weights alone would not.

relu and softmax are the two activation functions used in this network. ReLU is one line. Softmax is three: subtract the per-example maximum from each row before exponentiating (numerical stability, without it, large logits cause inf), exponentiate, then divide by the row sum so each row becomes a probability distribution. The axis=-1, keepdims=True arguments operate on the last axis, which here is the class axis, and preserve dimensions so that broadcasting works correctly during the division.

forward performs the forward pass in a single loop. It carries two lists. activations records the post-activation outputs of every layer including the input, because the backward pass needs each layer's input to compute the gradient with respect to that layer's weights. pre_activations records the values of Z = A_prev @ W + b before the non-linearity, because the backward pass needs them to evaluate the derivative of ReLU. The final layer uses softmax instead of ReLU, so the conditional if i == len(params) - 1 swaps activation functions on the last iteration only.

cross_entropy averages the negative log-probability of the correct class across the batch. The targets Y are one-hot encoded with shape (batch_size, n_classes). The + 1e-12 term inside the logarithm guards against log(0) when an unlucky float underflow drives the predicted probability of the correct class to exactly zero. Without the guard the loss occasionally becomes nan, the gradients become nan on the next step, and training silently destroys itself.

backward is the heart of the implementation. The combined gradient of softmax and cross-entropy with respect to the final-layer pre-activations is simply Y_hat - Y, divided by the batch size to give the mean-loss gradient (we derive this result in Section 9.9). That quantity is delta. The loop walks the layers in reverse. For each layer it accumulates two gradients: dW = A_prev.T @ delta, summing the outer product of the layer's input and the upstream error across the batch in one matrix multiply, and db = delta.sum(axis=0), summing the error itself across the batch. Then, if there is a layer below, it propagates delta backward through the linear map by multiplying by W.T, and through the ReLU by multiplying element-wise by the indicator pre_activations[i-1] > 0, which is the derivative of ReLU evaluated at the pre-activation. Finally the list is reversed so the gradients come out in forward order.

Training loop

def train(X, Y, X_test, Y_test, params, lr=0.5, epochs=10, batch_size=64):
    rng = np.random.default_rng(42)
    n = X.shape[0]
    for epoch in range(epochs):
        idx = rng.permutation(n)
        X_shuf, Y_shuf = X[idx], Y[idx]
        for i in range(0, n, batch_size):
            X_b, Y_b = X_shuf[i:i+batch_size], Y_shuf[i:i+batch_size]
            activations, pre_activations = forward(X_b, params)
            grads = backward(activations, pre_activations, Y_b, params)
            for j, (dW, db) in enumerate(grads):
                W, b = params[j]
                params[j] = (W - lr*dW, b - lr*db)
        # Evaluate
        test_acts, _ = forward(X_test, params)
        preds = test_acts[-1].argmax(axis=1)
        truth = Y_test.argmax(axis=1)
        acc = (preds == truth).mean()
        print(f"epoch {epoch}: test acc = {acc:.4f}")
    return params

The training loop is simpler than the gradient computation. Each epoch the indices 0 through n-1 are shuffled and the rows of X and Y are reordered to match. Shuffling between epochs is essential: stochastic gradient descent inherits its convergence guarantees from independence between successive mini-batches, and a fixed order would give every epoch identical batches and therefore identical gradient noise, which empirically slows convergence and can trap the optimiser in worse minima. The reshuffling is cheap, requiring only an integer permutation and two fancy-indexed copies.

Inside the epoch the loop chops the shuffled data into mini-batches of batch_size rows. For each batch it calls forward, then backward, then a parameter-update sweep that overwrites every (W, b) tuple with W - lr*dW and b - lr*db. This is plain stochastic gradient descent: no momentum, no adaptive learning rate, no weight decay. The constant lr=0.5 looks aggressive by 2020s standards, where 1e-3 is typical for Adam, but with He-initialised ReLU networks and full-batch normalisation by the cross-entropy gradient already dividing by B, a learning rate of order 0.1 to 1.0 is reasonable for plain SGD.

After every epoch the loop runs a forward pass on the test set and prints accuracy. There are three subtleties hidden in this evaluation. First, the test forward pass uses the same forward function as training; there is no train/eval mode switch because the network has no dropout or batch normalisation, both of which are introduced later in the chapter. Second, the prediction is argmax(axis=1) over the final-layer probabilities: the most-probable class wins, with ties broken by index. Third, the accuracy is computed against Y_test.argmax(axis=1) rather than against integer labels, on the assumption that targets are one-hot encoded; this is consistent with how cross_entropy consumes Y during training.

Notice what is not in this loop. There is no validation split distinct from the test set, no early-stopping criterion, no learning-rate schedule, no checkpoint saving. Each of these is genuinely useful in practice, and each is added in Sections 9.12 to 9.14. The minimal loop above is enough to take a freshly initialised network and drive its test accuracy from 10 per cent (random guessing on ten classes) to roughly 97 per cent on MNIST in under a minute on a laptop CPU. That is the whole point: backpropagation, mini-batch SGD, and a sensibly initialised ReLU MLP form a complete training recipe.

Where each line maps to the maths

It is worth setting the equations of Section 9.6 alongside the code so that the correspondence is unambiguous. Let layer l have weights W^l, biases b^l, pre-activations z^l = W^l a^{l-1} + b^l, and post-activations a^l = sigma(z^l). The four backpropagation equations are: (BP1) delta^L = nabla_a L * sigma'(z^L); (BP2) delta^l = ((W^{l+1})^T delta^{l+1}) * sigma'(z^l); (BP3) partial L / partial b^l = delta^l; (BP4) partial L / partial W^l = a^{l-1} (delta^l)^T. In the code, delta^l is the variable delta, a^{l-1} is activations[i], W^{l+1} is params[i][0] accessed via the closed-over name W, and sigma'(z^l) is (pre_activations[i-1] > 0).astype(np.float32) for ReLU.

The boundary condition (BP1) appears as the first three lines inside backward. Because the loss is cross-entropy and the final activation is softmax, the elegant identity delta^L = (Y_hat - Y) / B collapses (BP1) into a one-liner. The factor 1/B performs the average across the batch that the mean-loss formula calls for. From there the loop steps backwards. The recursion (BP2) is the line delta = (delta @ W.T) * (pre_activations[i-1] > 0).... The matrix multiplication delta @ W.T propagates the error backwards through the linear map, and the elementwise product with the ReLU indicator implements sigma'(z^{l-1}). The bias gradient (BP3) is db = delta.sum(axis=0): summing over the batch axis is exactly what partial L / partial b^l = sum_b delta^l_b requires. The weight gradient (BP4) is dW = A_prev.T @ delta: the matrix multiplication is a batched outer product, computing sum_b a^{l-1}_b (delta^l_b)^T in one BLAS call.

The forward function maps just as cleanly. The line Z = activations[-1] @ W + b is z^l = W^l a^{l-1} + b^l written batch-first instead of column-first; relu(Z) and softmax(Z) apply sigma. Stacking activations and pre_activations lists is the storage that backprop needs to evaluate its derivatives; without it, you would have to recompute every forward value during the backward pass, doubling the cost.

Two transposes deserve a sentence. In dW = A_prev.T @ delta, the transpose flips A_prev from (B, n_in) into (n_in, B) so the matrix multiply contracts on the batch axis and yields a (n_in, n_out) weight gradient with the right shape. In delta @ W.T, the transpose flips W from (n_in, n_out) into (n_out, n_in) so the propagated error has shape (B, n_in) to match the previous layer's activations. Both are linear-algebra adjoint operations: when you transpose the operator in the forward pass, you transpose it on the way back. This is the entire reason backpropagation is sometimes called the reverse-mode of automatic differentiation.

Common bugs

A handful of mistakes recur every time someone implements backprop from scratch. The first is forgetting the + 1e-12 inside cross_entropy. When training is going well the predicted probability of the correct class is close to one and there is no problem, but in the very first epochs, or whenever a mini-batch contains a hard example, the predicted probability can underflow float32 to zero and np.log(0) produces -inf. The next batch then yields nan gradients and the model is destroyed. Add the small constant or use a fused log_softmax + nll kernel; do not skip it.

The second is a softmax that does not subtract the per-example maximum before exponentiating. With logits of order 30, easily reached when training is not yet stable, np.exp(30) is finite but np.exp(100) overflows to inf, the row sum becomes inf, and the softmax row becomes nan. Subtracting the row maximum is mathematically equivalent (the constant cancels in numerator and denominator) and shifts the largest exponent to zero, eliminating overflow. It is one extra line; never omit it.

The third is the question of where to evaluate the ReLU derivative. The code above evaluates (pre_activations[i-1] > 0), which is the derivative at the pre-activation z. You could equivalently evaluate (activations[i] > 0), the indicator of the post-activation a, and you would get the same answer everywhere except at exactly zero, where ReLU is non-differentiable and the choice is arbitrary. Both work; only one is a faithful translation of the chain rule. Evaluating at z is the safer convention because it generalises to non-monotonic activations (such as GELU) where the post-activation does not determine the derivative.

The fourth bug is off-by-one indexing in the reverse loop. The line delta = (delta @ W.T) * (pre_activations[i-1] > 0) uses index i-1 because the freshly-multiplied W.T has just propagated the error one layer earlier, and the ReLU derivative must be evaluated at the new layer's pre-activation, not the current one. Writing pre_activations[i] here is the most common bug in the wild; it propagates a perfectly clean gradient through the wrong activation derivative and silently produces wrong (but plausible-looking) loss curves.

What you should take away

Backpropagation in vectorised NumPy is roughly eighty lines, including initialisation, forward, loss, backward and the SGD loop; there is no hidden machinery beyond matrix multiplication and elementwise arithmetic.
Every line of code maps to one symbol in the four governing equations of Section 9.6; if you can read the code you can read the maths, and vice versa.
Numerical stability is not optional. The two non-negotiable lines are the + 1e-12 inside the log and the per-row max subtraction inside softmax; both prevent silent training collapse.
The softmax-with-cross-entropy gradient is Y_hat - Y divided by the batch size; this single identity removes the need ever to compute the softmax Jacobian by hand.
The same eighty lines, with no library help beyond NumPy, will reach roughly 97 per cent test accuracy on MNIST; we plug them into the dataset and run the experiment in Section 9.16.