Neural Networks: 9.16   A from-scratch MNIST MLP

Dr Chris Paton

9.16 A from-scratch MNIST MLP

The previous sections of this chapter introduced the building blocks of a modern neural network one at a time. Section 9.6 derived backpropagation; 9.7 walked through one forward and backward pass by hand on a tiny 2-2-1 network; 9.10 explained why we initialise weights with small random numbers; 9.12 discussed regularisation; 9.13 covered normalisation; and 9.14 collected the practical tips that make training actually work. Each of those ideas was treated in isolation. In this section we put them together into a single, complete, runnable program.

The point of the exercise is that there is no magic. A working image classifier, one that reads in a 28-by-28 picture of a handwritten digit and tells you which number it is, correctly about ninety-seven times out of a hundred, is roughly fifty lines of NumPy. No PyTorch, no TensorFlow, no JAX, no GPU. Just arrays of floating-point numbers, a few matrix multiplications, and the chain rule. Every line is visible; nothing is hidden inside a framework call.

This is also a useful baseline. Once you have run this code, seen the loss go down, and watched test accuracy climb past ninety-five per cent, you will know exactly what every modern framework is doing for you. The frameworks add convenience, GPU support, automatic differentiation, and a great many optimisations, but the underlying algorithm is precisely what we are about to write out by hand.

Section 9.7 took one input vector through one forward and one backward pass on a 2-2-1 network with a couple of training examples. Section 9.16 scales the same calculation to a 784-128-64-10 network trained on sixty thousand MNIST examples for twenty epochs. Section 9.17 then re-implements exactly the same network in PyTorch so you can see how much shorter the framework version is and confirm that it does the same thing.

Symbols Used Here

$\mathbf{X}$minibatch matrix, shape $(B, 784)$, each row is a flattened MNIST image

$\mathbf{y}$target labels, shape $(B,)$, integer class 0–9

$\mathbf{Y}$one-hot targets, shape $(B, 10)$

$\mathbf{W}^{(\ell)}$$\mathbf{b}^{(\ell)}$, weights and biases of layer $\ell$

$\mathbf{Z}^{(\ell)}$$\mathbf{A}^{(\ell)}$, pre-activations and activations

$\hat{\mathbf{Y}}$predicted probabilities (softmax output)

$\mathcal{L}$average cross-entropy loss

$B$batch size (we use 64)

$\eta$learning rate (we use 0.5)

What MNIST is

MNIST is the standard "hello world" dataset of machine learning. It contains seventy thousand small grayscale images of handwritten digits, split into sixty thousand for training and ten thousand for testing. Each image is twenty-eight pixels wide and twenty-eight pixels tall. Each pixel is a single number between 0 (pure black) and 255 (pure white). Each image is labelled with the digit it depicts: a whole number from 0 to 9. So the task is straightforward: given the 784 pixel values of an image, predict which of the ten digits is written.

The dataset was assembled by Yann LeCun, Corinna Cortes and Christopher Burges in the 1990s out of scanned digit images from the United States Postal Service and the National Institute of Standards and Technology, hence the name (Modified NIST). It became the default benchmark for early neural-network work because it is small enough to train on a laptop in minutes, large enough that simple methods like nearest-neighbour do not solve it perfectly, and clean enough that the labels are unambiguous. A modern convolutional network can score above 99.7 per cent test accuracy on MNIST. Our humble multilayer perceptron will reach about 97 per cent, which is more than enough to demonstrate that the algorithm is learning something real.

If you imagine a single MNIST image of the digit 3, it might look something like this in ASCII art (light pixels shown as dots, dark pixels as hashes):

............................
............................
.........#######............
........##########..........
.......##......###..........
...............###..........
.............####...........
...........#####............
.........#######............
............#####...........
...............###..........
................##..........
................##..........
......##........##..........
.....###.......###..........
.....##########.............
......#######...............
............................

The training set contains sixty thousand of these. We can download MNIST from several places: Yann LeCun's original site, the OpenML repository, the scikit-learn helper fetch_openml, or PyTorch's torchvision.datasets. We will use scikit-learn because it is the simplest path that does not require any deep-learning library.

Loading and preprocessing

import numpy as np
from sklearn.datasets import fetch_openml

# Download MNIST (takes a minute the first time, then cached on disk)
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X = X.astype(np.float32) / 255.0  # normalise pixel values to [0, 1]
y = y.astype(np.int64)

# Train/test split: first 60k train, last 10k test (the canonical split)
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

# One-hot encode the targets so they match the network's 10-class output
def one_hot(y, k=10):
    Y = np.zeros((y.shape[0], k), dtype=np.float32)
    Y[np.arange(y.shape[0]), y] = 1.0
    return Y

Y_train = one_hot(y_train)
Y_test = one_hot(y_test)

The first call, fetch_openml('mnist_784', ...), downloads the seventy-thousand images as a single matrix X of shape $(70000, 784)$, where each row is one image flattened from a 28-by-28 grid into a 784-element vector. The labels y come back as a vector of length seventy thousand. The argument as_frame=False asks for plain NumPy arrays rather than a pandas DataFrame.

The next line divides every pixel value by 255. Why? Pixel intensities range from 0 to 255, but neural networks train more stably when their inputs are small numbers near zero, typically between 0 and 1 or between $-1$ and $1$. Large input magnitudes lead to large pre-activations, which lead to large gradients, which can blow up the weights. Dividing by 255 puts every input pixel in the range $[0, 1]$, which is exactly what the He initialisation we will use in a moment was designed for. We also cast to float32 because that is the standard precision for neural-network arithmetic and uses half the memory of float64.

We then split into a training set of sixty thousand and a test set of ten thousand. This is the canonical split distributed by every source of MNIST, so our results will be directly comparable to anyone else's.

Finally we one-hot-encode the labels. The label 7 becomes the vector $(0, 0, 0, 0, 0, 0, 0, 1, 0, 0)$, a 10-element vector that is zero everywhere except at index 7. The reason is that our network's output layer will produce ten probabilities, one for each digit, and the cross-entropy loss compares those ten predicted probabilities to a target distribution. The "true" distribution is one-hot: probability 1 on the correct class, 0 on all the others. The function one_hot builds this matrix by creating a zeros array and using fancy indexing to put a 1 in the correct column of each row.

Initialising the network

def init_params(layer_sizes, seed=0):
    rng = np.random.default_rng(seed)
    params = []
    for n_in, n_out in zip(layer_sizes[:-1], layer_sizes[1:]):
        # He initialisation: standard normal scaled by sqrt(2/n_in), suitable for ReLU
        W = rng.standard_normal((n_in, n_out)) * np.sqrt(2.0 / n_in)
        b = np.zeros(n_out)
        params.append((W, b))
    return params

# Architecture: 784 inputs -> 128 hidden -> 64 hidden -> 10 outputs
params = init_params([784, 128, 64, 10])

The function init_params takes a list describing the size of each layer and returns a list of (W, b) tuples, one weight matrix and one bias vector per layer. For our 784-128-64-10 architecture there are three weight matrices because there are three sets of connections: input-to-first-hidden, first-hidden-to-second-hidden, and second-hidden-to-output. The shapes are $(784, 128)$, $(128, 64)$ and $(64, 10)$ respectively. Each weight matrix W connects n_in units to n_out units, so it has n_in rows and n_out columns.

Why He initialisation? Section 9.10 explained that if you initialise weights too large the activations explode and gradients overflow; too small and signals shrink to zero as they pass through layers. He initialisation, named after Kaiming He, sets the standard deviation of each weight to $\sqrt{2 / n_\text{in}}$. The factor of 2 in the numerator accounts for ReLU activations, which zero out half their inputs on average. With this scaling, the variance of the activations stays roughly constant from layer to layer, neither growing nor shrinking, so signals propagate cleanly through the network at the start of training.

Why zero biases? Biases do not suffer from the symmetry-breaking problem that weights have. If you initialised every weight to zero, every neuron in a layer would compute the same thing forever, because they would all see the same inputs and update by the same gradients. But if the weights are random, each neuron starts off computing something different, and zero biases are perfectly fine, they will be learned during training.

Counting the parameters: the input-to-hidden-1 layer has $784 \times 128 = 100\,352$ weights and 128 biases, so 100 480 parameters. The hidden-1-to-hidden-2 layer has $128 \times 64 = 8\,192$ weights and 64 biases, so 8 256 parameters. The hidden-2-to-output layer has $64 \times 10 = 640$ weights and 10 biases, so 650 parameters. Adding these up gives $100\,480 + 8\,256 + 650 = 109\,386$ parameters. By modern standards a hundred thousand is tiny, large language models have tens or hundreds of billions, but it is more than enough to learn MNIST. We have sixty thousand training examples, so the ratio of parameters to examples is about 1.8, which means the network has plenty of capacity to memorise the training set if we let it.

Forward pass with softmax

def relu(z):
    return np.maximum(0, z)

def softmax(z):
    # numerically stable softmax: subtract row-wise maximum before exponentiating
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def forward(X, params):
    activations = [X]      # store every layer's output, including the input
    pre_activations = []   # store every Z = AW + b before the activation
    for i, (W, b) in enumerate(params):
        Z = activations[-1] @ W + b
        pre_activations.append(Z)
        if i < len(params) - 1:
            A = relu(Z)         # ReLU on hidden layers
        else:
            A = softmax(Z)      # softmax on the output layer
        activations.append(A)
    return activations, pre_activations

The forward pass takes a batch of inputs X of shape $(B, 784)$, where $B$ is the batch size, and pushes it through the network layer by layer. At each layer it computes the pre-activation $\mathbf{Z} = \mathbf{A}_\text{prev}\mathbf{W} + \mathbf{b}$, then applies the activation function. For our hidden layers the activation is ReLU, defined as $\text{ReLU}(z) = \max(0, z)$, a piecewise-linear function that simply replaces negative numbers with zero. For the output layer the activation is softmax, which converts the ten raw scores into ten probabilities that sum to one.

The numerically stable softmax deserves a moment of attention. Naively, $\text{softmax}(z)_i = e^{z_i} / \sum_j e^{z_j}$. The problem is that $e^{z}$ overflows to infinity when $z$ is even moderately large, about 89 in single precision. The trick is that softmax is invariant to additive constants: subtracting any number from every element of $z$ before exponentiating gives the same result. So we subtract the per-row maximum, which guarantees the largest exponent we ever compute is $e^0 = 1$ and everything else is at most that. The division by the sum normalises the result back to a probability distribution. The argument keepdims=True keeps the shape compatible for broadcasting along the last axis.

The function returns two lists: activations, with $L+1$ entries (the input plus one output per layer), and pre_activations, with $L$ entries. We need both during the backward pass, the activations to compute weight gradients, and the pre-activations to know where ReLU was active and where it was zero.

You can sanity-check the softmax on a small example. Calling softmax(np.array([[1.0, 2.0, 3.0]])) returns approximately $(0.090, 0.245, 0.665)$, which sums to 1 and is monotonic in the inputs. That is exactly what a probability distribution over three classes should look like.

Cross-entropy loss

def cross_entropy(Y_hat, Y):
    # Y_hat shape (B, 10), predicted probabilities
    # Y     shape (B, 10), one-hot true labels
    eps = 1e-12  # tiny constant to avoid log(0)
    return -np.mean(np.sum(Y * np.log(Y_hat + eps), axis=1))

Cross-entropy measures how far the predicted distribution Y_hat is from the true distribution Y. For a single example the formula is $-\sum_k Y_k \log \hat Y_k$. Because Y is one-hot, exactly one entry is 1 and the rest are 0, only the term for the correct class survives, and the loss for that example reduces to $-\log \hat Y_\text{correct class}$. So if the network puts probability 0.7 on the correct class, the loss is $-\log(0.7) \approx 0.357$. If it puts probability 0.99, the loss drops to $-\log(0.99) \approx 0.010$. If it puts only 0.01 on the correct class, the loss balloons to $-\log(0.01) \approx 4.61$. Lower is better, and the function blows up to infinity when the network is confidently wrong.

The eps = 1e-12 term is defensive: if the network ever predicts a probability of exactly zero, $\log(0)$ is negative infinity and our training crashes. Adding a tiny constant prevents that. The np.mean over the batch dimension averages the per-example losses, giving a single scalar loss for the whole minibatch.

Backward pass

def backward(activations, pre_activations, Y, params):
    grads = []
    B = Y.shape[0]
    Y_hat = activations[-1]
    # Combined softmax + cross-entropy gradient simplifies to (Y_hat - Y), divided by batch size
    delta = (Y_hat - Y) / B
    for i in reversed(range(len(params))):
        A_prev = activations[i]
        W, b = params[i]
        dW = A_prev.T @ delta
        db = delta.sum(axis=0)
        grads.append((dW, db))
        if i > 0:
            # Backpropagate through ReLU: zero the gradient where the pre-activation was negative
            delta = (delta @ W.T) * (pre_activations[i-1] > 0).astype(np.float32)
    return list(reversed(grads))

This is where the chain rule lives. The variable delta carries the gradient of the loss with respect to the current layer's pre-activation. We start at the output layer. Section 9.6 derived a small miracle: when you combine softmax with cross-entropy, the gradient of the loss with respect to the pre-activation is simply $\hat{\mathbf{Y}} - \mathbf{Y}$. All the messy partial derivatives of softmax cancel against the messy partial derivatives of cross-entropy, leaving an answer that is no harder than the gradient for linear regression. We divide by B straight away because our loss is the mean over the batch, and dividing here matches the mean cleanly.

For each layer, working backwards, we need three things: the gradient with respect to the weights W, the gradient with respect to the biases b, and the gradient with respect to the previous layer's activation, which becomes the next iteration's delta. The weight gradient is $\mathbf{A}_\text{prev}^\top \mathbf{\delta}$, a matrix multiplication that gives a result of the same shape as W. Intuitively, each weight's gradient is the input flowing into it times the error coming out of it. The bias gradient is just $\sum_b \mathbf{\delta}$, summed over the batch dimension, because the bias is added equally to every example.

To pass the gradient back through to the previous layer we first multiply by $\mathbf{W}^\top$, undoing the linear transformation, and then multiply elementwise by the derivative of ReLU. ReLU's derivative is 1 where the pre-activation was positive and 0 where it was negative, so (pre_activations[i-1] > 0).astype(np.float32) is exactly the right mask. This zeroes out the gradient on units that were not active during the forward pass, which is correct because changing a weight feeding into a dead ReLU has no effect on the output.

We accumulate gradients in reverse order (from the output layer back to the input) and then reverse the list at the end so it lines up with params for the update step.

A subtle point: notice that we never compute the loss inside backward. We only need the predicted and true distributions and the cached forward-pass tensors. The loss value is useful for monitoring training, but the gradient computation does not need it.

Training loop

def train(X_train, Y_train, X_test, Y_test, params, lr=0.5, epochs=20, batch_size=64):
    n = X_train.shape[0]
    rng = np.random.default_rng(0)
    for epoch in range(epochs):
        # Shuffle the training set at the start of each epoch
        idx = rng.permutation(n)
        X_shuf, Y_shuf = X_train[idx], Y_train[idx]
        # Iterate over the training set in minibatches
        for i in range(0, n, batch_size):
            X_batch = X_shuf[i:i+batch_size]
            Y_batch = Y_shuf[i:i+batch_size]
            activations, pre_activations = forward(X_batch, params)
            grads = backward(activations, pre_activations, Y_batch, params)
            # Plain SGD update
            for j, (dW, db) in enumerate(grads):
                W, b = params[j]
                params[j] = (W - lr * dW, b - lr * db)
        # Evaluate train and test accuracy at the end of each epoch
        train_acc = accuracy(X_train, Y_train, params)
        test_acc = accuracy(X_test, Y_test, params)
        print(f"Epoch {epoch+1}: train acc {train_acc:.4f}, test acc {test_acc:.4f}")

def accuracy(X, Y, params):
    activations, _ = forward(X, params)
    pred = activations[-1].argmax(axis=1)
    true = Y.argmax(axis=1)
    return (pred == true).mean()

train(X_train, Y_train, X_test, Y_test, params)

The training loop has two nested levels. The outer loop runs for twenty epochs, where one epoch means one full pass over the entire training set. The inner loop chops that pass into minibatches of sixty-four examples. For each minibatch we run a forward pass, compute the gradients with a backward pass, and update every weight and bias by subtracting the learning rate times its gradient. That last step is plain stochastic gradient descent: $\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_\mathbf{W} \mathcal{L}$.

Why shuffle? If we always presented the training examples in the same order, the network would tend to memorise that order rather than the underlying patterns, and minibatch gradients would become correlated across epochs. Reshuffling once per epoch breaks this correlation and gives the gradient estimator something close to the unbiased properties stochastic gradient descent assumes.

Why a learning rate of 0.5? It looks large compared to the 0.001 or 0.0001 values you might see in PyTorch tutorials, but those typically use the Adam optimiser, which adapts the per-parameter step size. Plain SGD with our He-initialised network and properly normalised inputs likes a step size around 0.1 to 1. We picked 0.5 by trying a few values on a single epoch.

Why batch size 64? Sixty-four is a common default, large enough to give a smooth gradient estimate and to make use of vectorised matrix multiplication, small enough that we get many updates per epoch. Sixty thousand examples divided by sixty-four gives roughly 938 weight updates per epoch, or close to nineteen thousand updates over twenty epochs.

The accuracy function is straightforward: run a forward pass, take the argmax along the class axis to convert probabilities into predicted classes, and compute the fraction that match the true labels.

When you actually run this on a laptop you should see output that looks something like:

Epoch 1: train acc 0.9293, test acc 0.9285
Epoch 2: train acc 0.9521, test acc 0.9482
Epoch 3: train acc 0.9651, test acc 0.9583
...
Epoch 20: train acc 0.9881, test acc 0.9712

The numbers will vary a little depending on your NumPy version and random seed, but you should comfortably reach about 97 per cent test accuracy after twenty epochs and roughly 98 to 99 per cent training accuracy. The 1 to 2 per cent gap between training and test accuracy is overfitting: the network has learned the training set slightly better than the underlying task. The next chapter, on regularisation in practice, covers techniques (weight decay, dropout, data augmentation) that close this gap. For now, 97 per cent is more than respectable for a fifty-line program.

What you should take away

A working MLP is about fifty lines of NumPy. No framework, no GPU, no special tricks. Everything in this section runs on a laptop CPU in a few minutes.
The algorithm is exactly the math we already derived. The forward pass is matrix multiplications followed by activation functions; the backward pass is the chain rule applied layer by layer. Section 9.7 walked through the same calculation on a 2-2-1 network; this section is the same calculation at scale.
Ninety-seven per cent MNIST test accuracy is achievable with no tricks. No momentum, no learning-rate schedule, no dropout, no batch norm, no data augmentation, just a plain MLP, plain SGD, He initialisation, and a sensible learning rate.
Modern deep learning is the same algorithm with bigger data and bigger networks. The forward and backward passes inside GPT-4 are made of the same building blocks, just stacked higher and trained on far more text. Replacing 109 thousand parameters with 100 billion does not change the underlying mathematics, only the engineering needed to keep it running.
PyTorch and JAX are conveniences over this skeleton, not different things. When you write loss.backward() in PyTorch, the framework runs the same chain-rule traversal we wrote out by hand. The next section, 9.17, re-implements the same network in PyTorch so you can see exactly which lines disappear and which lines remain.