9.3 Multilayer perceptrons
Section 9.2 ended on a discouraging note. A single perceptron, no matter how cleverly its weights are chosen, cannot decide whether two binary inputs disagree. The XOR function, true when exactly one of the two inputs is true, defeats it. The reason was geometric: a single perceptron draws one straight line through the input space, and no straight line separates the two XOR classes. For a long stretch of the 1970s and early 1980s this small fact was taken as evidence that artificial neurons were a dead end.
The way out is short. Stack the perceptrons. Put one row of neurons after another, feed the output of the first row into the second, and the resulting machine can compute XOR, and, as we shall see in §9.5, almost any function you might reasonably want. The stacked machine is called a multilayer perceptron, abbreviated MLP, and it is the foundation on which every modern neural network rests. Convolutional networks (§9.18) are MLPs with a special weight-sharing pattern; transformer language models (Chapter 13) are MLPs with attention layers added between them; even the diffusion models that produce photographic images are sequences of MLPs run forwards and backwards. Once you understand the MLP, the rest is augmentation rather than reinvention.
A layer is the central object. In plain words, a layer is a row of artificial neurons that all see the same input and all produce one output each. A layer of three neurons that receives a vector of five numbers will produce a vector of three numbers, one number per neuron. The numbers leaving the layer become the input to the next layer, which has its own row of neurons, and so on, until the last layer produces what the network calls its prediction. This chapter describes that machine carefully and then walks through a small concrete example by hand.
This section adds depth to the single neuron of §9.1: multiple rows in sequence. §9.4 asks which nonlinear functions to use between the rows (sigmoid, tanh, ReLU and their relatives); §9.5 states the universal approximation theorem; §9.6 explains backpropagation. By the end of the chapter you will have built one such network from scratch in NumPy.
What is a layer?
Imagine three artificial neurons stacked vertically, like three light bulbs on a vertical strip. Each neuron has its own weights and its own bias. When an input vector $\mathbf{x}$ arrives, every one of the three neurons sees the whole vector, not a part of it, not a slice, but every component. Each neuron then computes a weighted sum of those components, adds its bias, and passes the result through a nonlinear function such as sigmoid. The result is a single number. Three neurons, three numbers. We collect those three numbers into a new vector, and that vector is the layer's output.
Two pieces of vocabulary follow immediately. The width of a layer is the number of neurons it contains; in our example the width is three. The depth of a network is the number of layers, which we write as $L$. A network with one hidden layer between input and output has depth $L = 2$ in the convention used here (the input layer is not counted because it does no computation; it is simply the place where the data enters). When people speak loosely about a "deep" network they usually mean one with many hidden layers, anywhere from a handful in older work to several hundred in some modern transformers.
It helps to picture the data flow. The input vector $\mathbf{a}^{(0)} = \mathbf{x}$ enters from the left. The first layer turns it into a new vector $\mathbf{a}^{(1)}$. The second layer takes $\mathbf{a}^{(1)}$ as its input and produces $\mathbf{a}^{(2)}$. This process repeats until the final layer, which produces $\mathbf{a}^{(L)} = \hat{\mathbf{y}}$, the network's prediction. At every stage the dimensions can change: the input might have 784 components (if it is a 28-by-28 grayscale image flattened), the first hidden layer might widen this to 256 components, the next narrow it to 128, and the output layer compress it down to 10 (one per digit class).
A useful intuition is that each hidden layer is in the business of re-describing the input. The first hidden layer might learn features such as "is the top-left pixel dark?" or "is there a roughly horizontal edge near the middle?". The second hidden layer takes those features and combines them into more abstract ones, "is there a closed loop?", "is there a vertical stroke meeting a horizontal one?". The final layer combines those still-more-abstract features into the answer the task demands. None of this hierarchy is hand-coded; it emerges from training. When we look inside a trained image classifier, we typically find the early layers detecting edges and textures, the middle layers detecting parts of objects, and the late layers detecting whole objects, a pattern repeated across architectures from the 1990s LeNet to today's vision transformers.
The layers between the input and the output are called hidden layers, a name inherited from early neural-network papers. The "hidden" simply means "not directly observed": these layers' values are neither the data we feed in nor the prediction we read out, so during training they are inferred indirectly through their effect on the final prediction. A network with no hidden layers is just a single layer of neurons applied to the input, exactly the perceptron of §9.2. Adding hidden layers is what gives the MLP its expressive power.
Forward propagation, formally
The arithmetic that turns the input into the prediction is called the forward pass or forward propagation. For each layer $\ell = 1, \ldots, L$ we compute two quantities:
$$\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}) .$$
The first quantity, $\mathbf{z}^{(\ell)}$, is called the pre-activation. It is what each neuron in the layer would output if there were no nonlinearity at all, just a weighted sum of inputs plus a bias. The second quantity, $\mathbf{a}^{(\ell)}$, is the activation: the pre-activation passed through the nonlinear function $\sigma$.
The shapes of every array are worth checking carefully, because shape errors are by far the commonest bug when implementing networks. The previous layer's activation $\mathbf{a}^{(\ell-1)}$ is a vector of length $d_{\ell-1}$. The weight matrix $\mathbf{W}^{(\ell)}$ has $d_\ell$ rows and $d_{\ell-1}$ columns: one row per neuron in the layer, one column per input the layer receives. The matrix-vector product $\mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)}$ is therefore a vector of length $d_\ell$. The bias $\mathbf{b}^{(\ell)}$ also has length $d_\ell$, so the addition is element-wise and the result $\mathbf{z}^{(\ell)}$ has length $d_\ell$. Finally, $\sigma$ is applied element-wise, meaning the function is run separately on each component: if $\mathbf{z}^{(\ell)} = (1.0, -0.5, 2.3)$ and $\sigma$ is the sigmoid, then $\mathbf{a}^{(\ell)} = (\sigma(1.0), \sigma(-0.5), \sigma(2.3))$. The activation vector $\mathbf{a}^{(\ell)}$ keeps the same length, $d_\ell$, and becomes the input to the next layer.
The output of the whole network is whatever the final layer produces: $\hat{\mathbf{y}} = \mathbf{a}^{(L)}$. The hat denotes a prediction, distinguishing it from the true target $\mathbf{y}$ that the network is trying to match.
Worked numerical example: a 2-2-1 sigmoid network
Concrete numbers anchor the algebra. Consider a tiny network with two inputs, one hidden layer of two neurons, and one output neuron, all using the sigmoid activation. The architecture is sometimes called "2-2-1", two inputs, two hidden, one output. We will use weights and biases small enough to compute by hand.
Take the input $\mathbf{x} = (1, 0)^\top$ and suppose the target is $y = 1$. The hidden layer has two sigmoid neurons with parameters
$$\mathbf{W}^{(1)} = \begin{pmatrix} 0.5 & -0.3 \\ 0.2 & 0.8 \end{pmatrix}, \qquad \mathbf{b}^{(1)} = \begin{pmatrix} 0.1 \\ -0.2 \end{pmatrix} .$$
Each row of $\mathbf{W}^{(1)}$ holds the weights of one hidden neuron: the top row $(0.5, -0.3)$ gives the first neuron's weights on inputs one and two; the bottom row $(0.2, 0.8)$ gives the second neuron's. The output layer is a single sigmoid neuron with parameters
$$\mathbf{W}^{(2)} = \begin{pmatrix} 0.7 & -0.5 \end{pmatrix}, \qquad b^{(2)} = 0.05 .$$
The 1-by-2 weight matrix means the output neuron weights the two hidden activations by 0.7 and -0.5 respectively, and adds 0.05.
We now compute the forward pass step by step.
Step 1. Hidden pre-activation. The matrix-vector product $\mathbf{W}^{(1)} \mathbf{x}$ has two components:
$$\mathbf{z}^{(1)} = \begin{pmatrix} 0.5 \cdot 1 + (-0.3) \cdot 0 \\ 0.2 \cdot 1 + 0.8 \cdot 0 \end{pmatrix} + \begin{pmatrix} 0.1 \\ -0.2 \end{pmatrix} = \begin{pmatrix} 0.5 + 0.1 \\ 0.2 - 0.2 \end{pmatrix} = \begin{pmatrix} 0.6 \\ 0 \end{pmatrix} .$$
Step 2. Hidden activation. Apply the sigmoid element-wise:
$$\mathbf{a}^{(1)} = \begin{pmatrix} \sigma(0.6) \\ \sigma(0) \end{pmatrix} = \begin{pmatrix} 1/(1 + e^{-0.6}) \\ 1/(1 + e^{0}) \end{pmatrix} = \begin{pmatrix} 0.6457 \\ 0.5000 \end{pmatrix} ,$$
rounded to four decimal places. The value $\sigma(0) = 0.5$ exactly; the value $\sigma(0.6) \approx 0.6457$ comes from $e^{-0.6} \approx 0.5488$ so $1/(1 + 0.5488) = 1/1.5488 \approx 0.6457$.
Step 3. Output pre-activation. Multiply the hidden activation by the output weights and add the output bias:
$$z^{(2)} = 0.7 \cdot 0.6457 + (-0.5) \cdot 0.5000 + 0.05 = 0.4520 - 0.2500 + 0.0500 = 0.2520 .$$
Step 4. Output activation. Apply sigmoid one final time:
$$\hat{y} = \sigma(0.2520) = 1 / (1 + e^{-0.2520}) \approx 0.5627 .$$
So the network predicts $\hat{y} \approx 0.5627$ when the target is $y = 1$. Two sanity checks confirm this is not nonsense. First, the output lies between 0 and 1, as it must when the final layer is a sigmoid. Second, the prediction is closer to 1 than to 0, that is, on the correct side, but only weakly so; if these weights were used to classify by the rule "predict 1 when $\hat y > 0.5$" the example would be classified correctly, but the confidence is poor. Training, which we describe in §9.6, would adjust the weights so the network produces a value much closer to 1 the next time it sees this same input.
Why depth helps
A natural question is whether all this layering is necessary. Could we not just use one very wide layer? The answer, given properly in §9.5, is yes, in principle. The universal approximation theorem of Cybenko (1989) and Hornik (1991) says that a network with a single hidden layer of sufficient width can approximate any continuous function to arbitrary accuracy. From this fact one might conclude that depth buys nothing.
Modern theory says otherwise. Telgarsky (2016) and Eldan and Shamir (2016) proved depth-separation theorems: there exist functions that a deep network can represent using a polynomial number of neurons but that any shallow network needs an exponential number of neurons to approximate. Translated into plain language: there are problems for which a 10-layer network of modest width does what a 1-layer network would need a billion neurons to match. Depth gives the network the ability to compose simple features, edges into corners, corners into shapes, shapes into objects, and that compositional structure is exponentially more efficient than building everything in a single step.
There is also a practical reason. Training a wide-but-shallow network is, in current practice, harder than training a deep one. The optimisation landscape of a single huge layer is ill-behaved; gradients are noisy, learning is slow, and the parameter count is enormous. Deep networks, when combined with the modern tools that fix their training pathologies, careful initialisation (§9.10), normalisation (§9.13), residual connections (§9.18), turn out to be both more parameter-efficient and easier to optimise. The dominance of deep architectures in every application area, from speech recognition to protein structure prediction, reflects this empirical lesson. Theory and practice agree: depth is not optional.
Why width helps
Width matters too, but in a different way. At any fixed depth, more neurons per layer give the network more capacity: more directions in which it can carve up the input space, more basis functions it can combine. A network that is too narrow simply cannot express the function the data demands; the prediction error stays high no matter how long training continues. A network that is too wide, on the other hand, has so many degrees of freedom that it can memorise the training data and fail to generalise to new examples, a phenomenon known as overfitting (§9.12). It also wastes memory and compute. Modern practice picks depth and width together, often by trying several combinations and selecting the smallest network whose validation error is acceptable.
A small back-of-the-envelope calculation gives a feel for parameter counts. An MLP with sizes $(784, 256, 128, 10)$, the kind one might use to classify the MNIST handwritten digits dataset, has weight matrices of size $256 \times 784$, $128 \times 256$ and $10 \times 128$. That is $200{,}704 + 32{,}768 + 1{,}280 = 234{,}752$ weights, plus $256 + 128 + 10 = 394$ biases, totalling $235{,}146$ parameters. At single precision (32-bit floats, 4 bytes per parameter) the network occupies roughly 940 kilobytes of memory. A modern transformer language model multiplies both numbers by four to six orders of magnitude, billions of parameters, occupying tens of gigabytes, but the underlying arithmetic per layer is exactly what we have just described. The cost of one forward pass is dominated by the matrix multiplications, which is why graphics processing units, originally designed to render polygons, became the workhorses of deep learning: they are exactly the right hardware for dense matrix-vector products.
How modern networks differ from a textbook MLP
The MLP described in this section is the foundation, but a state-of-the-art language model is not a plain MLP. Three of the most important augmentations are worth flagging now. Residual connections (§9.18) add the input of a layer to its output, so the network learns a small correction rather than a fresh mapping; this stabilises training in very deep networks. Layer normalisation (§9.13) rescales the pre-activations so their distribution stays roughly constant during training, which prevents gradients from vanishing or exploding. Attention (Chapter 13) replaces the fixed weight matrix $\mathbf{W}^{(\ell)}$ with one that is computed on the fly from the input itself, allowing the network to route information dynamically. Each of these is a well-defined modification of the basic forward pass we have just written down, and once you understand the textbook MLP they all become natural extensions rather than fresh mysteries.
In addition, modern practice rarely processes one example at a time. The forward pass is almost always run on a mini-batch of examples in parallel, typically dozens to thousands at once, by stacking the input vectors into a matrix and replacing matrix-vector multiplies with matrix-matrix multiplies. The mathematics is unchanged; the gain is in throughput, because GPUs can multiply two large matrices much faster than they can perform the same number of small matrix-vector products one after another. The output activations of the hidden layers are also stored, because they will be needed by the backpropagation algorithm of §9.6 to compute gradients. Storing activations costs memory in proportion to the batch size and the network width; this is one of the practical reasons why training very deep networks on long sequences pushes hardware to its limits.
What you should take away
- A layer is a row of artificial neurons that all see the same input and each produce one number; the layer's output is the vector of those numbers.
- A multilayer perceptron is a sequence of such layers, each fed by the one before it, ending in a prediction $\hat{\mathbf{y}} = \mathbf{a}^{(L)}$.
- The forward pass at each layer is a matrix-vector multiply, a bias addition, and an element-wise nonlinearity: $\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}$, $\mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)})$.
- Width is neurons-per-layer, depth is number of layers; both contribute to capacity, with depth being exponentially more efficient for compositional functions.
- Every modern architecture, CNNs, transformers, diffusion models, is built from this skeleton with extra mechanisms added; mastering the MLP is the gateway to understanding all of them.