9.6 Backpropagation: the full derivation
The result is the algorithm that powers every modern neural network, from a two-neuron toy model to a frontier large language model. §9.7 walks through a numerical example by hand and §9.8 implements the same equations in NumPy.
The problem stated plainly
The forward pass of an MLP defines a function $\hat{\mathbf{y}} = f(\mathbf{x}; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \ldots, \mathbf{W}^{(L)}, \mathbf{b}^{(L)})$, where the parameters are the weights and biases of every layer. Given a training example $(\mathbf{x}, \mathbf{y})$, the loss $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$ is a single real number that says how badly the prediction matches the target. To improve the network we want $\partial \mathcal{L} / \partial w$ for every scalar parameter $w$ in every weight matrix and bias vector. That is the gradient, and gradient descent will use it to nudge each parameter a little in the loss-decreasing direction.
How might we compute this naively? One option is symbolic differentiation: write $\mathcal{L}$ as a giant algebraic expression in all the parameters, then differentiate. This produces formulae whose size grows exponentially with depth, a phenomenon sometimes called expression swell, and is wholly impractical. A second option is numerical differentiation: perturb each parameter by a tiny amount $\epsilon$, run the forward pass, and approximate the partial derivative as $(\mathcal{L}(w + \epsilon) - \mathcal{L}(w))/\epsilon$. This works, but it costs one forward pass per parameter. A network with one million parameters would need a million forward passes to compute one gradient. For a model with a hundred billion parameters this is hopeless.
Backpropagation computes all the gradients in roughly the cost of two forward passes (one forward, one backward), regardless of how many parameters the network has. That linear-in-parameters cost is what made deep learning practical. The trick is that the chain rule lets us reuse partial computations: instead of redoing the work for each weight, we compute one shared "error signal" at each layer and combine it with cached forward-pass quantities to read off every gradient.
The chain rule, in two flavours
Backpropagation is one application of the chain rule. We need it in two forms. The scalar chain rule, taught in introductory calculus, says that if $z = g(y)$ and $y = f(x)$, then
$$\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}.$$
The interpretation is: a small change in $x$ produces a small change in $y$ at rate $dy/dx$, and that small change in $y$ produces a change in $z$ at rate $dz/dy$. Multiply the rates and you have the rate at which $x$ moves $z$.
The vector chain rule generalises this to vector-valued functions. If $\mathbf{y} = f(\mathbf{x})$ with $\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{y} \in \mathbb{R}^m$, the Jacobian $\partial \mathbf{y}/\partial \mathbf{x}$ is the $m \times n$ matrix whose $(i, j)$ entry is $\partial y_i / \partial x_j$. If we then form $\mathbf{z} = g(\mathbf{y})$, the Jacobian of the composition is the matrix product
$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}.$$
Take a small worked example. Let $\mathbf{x} = (x_1, x_2)$ and define $\mathbf{y} = (x_1^2,\, x_1 x_2)$ and $\mathbf{z} = (y_1 + y_2,\, y_1 - y_2)$. Compute $\partial \mathbf{z} / \partial \mathbf{x}$ two ways.
Direct. Substitute: $z_1 = x_1^2 + x_1 x_2$ and $z_2 = x_1^2 - x_1 x_2$. Differentiating each component,
$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \begin{pmatrix} 2x_1 + x_2 & x_1 \\ 2x_1 - x_2 & -x_1 \end{pmatrix}.$$
By the chain rule. The Jacobians are
$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{pmatrix} 2x_1 & 0 \\ x_2 & x_1 \end{pmatrix}, \qquad \frac{\partial \mathbf{z}}{\partial \mathbf{y}} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}.$$
Multiplying,
$$\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \begin{pmatrix} 2x_1 & 0 \\ x_2 & x_1 \end{pmatrix} = \begin{pmatrix} 2x_1 + x_2 & x_1 \\ 2x_1 - x_2 & -x_1 \end{pmatrix},$$
which matches. Backpropagation is exactly this matrix-product chain rule, applied $L$ times, once for each layer. The only subtlety is that we will multiply Jacobians from the output side inward, because the loss is a scalar, and that lets us use vector–matrix products instead of full matrix–matrix products at each step.
Forward pass, written out
Before we differentiate we need to know what we are differentiating. The forward pass for a generic MLP is the chain of equations
$$\mathbf{a}^{(0)} = \mathbf{x}, \qquad \mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}),$$
evaluated for $\ell = 1, 2, \ldots, L$, with the prediction $\hat{\mathbf{y}} = \mathbf{a}^{(L)}$. In words: at each layer we take the previous layer's activation, hit it with a linear map $\mathbf{W}^{(\ell)}$, add a bias, then squash through the activation function $\sigma$ to produce the next layer's activation.
To make this concrete we will work through a tiny network with two inputs, one hidden layer of two sigmoid neurons, and one sigmoid output neuron. So $L = 2$, $d_0 = 2$, $d_1 = 2$, $d_2 = 1$. Pick weights with clean numbers:
$$\mathbf{W}^{(1)} = \begin{pmatrix} 0.5 & 1.0 \\ -1.0 & 1.0 \end{pmatrix}, \quad \mathbf{b}^{(1)} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \quad \mathbf{W}^{(2)} = \begin{pmatrix} 1.0 & -1.0 \end{pmatrix}, \quad b^{(2)} = 0.$$
Take input $\mathbf{x} = (1, 0)$ and target $y = 1$. Use the sigmoid $\sigma(z) = 1/(1 + e^{-z})$, whose derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.
Layer 1:
$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} = \begin{pmatrix} 0.5 \cdot 1 + 1.0 \cdot 0 \\ -1.0 \cdot 1 + 1.0 \cdot 0 \end{pmatrix} = \begin{pmatrix} 0.5 \\ -1.0 \end{pmatrix}.$$
Then
$$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{pmatrix} \sigma(0.5) \\ \sigma(-1.0) \end{pmatrix} \approx \begin{pmatrix} 0.6225 \\ 0.2689 \end{pmatrix}.$$
Layer 2:
$$z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)} = 1.0 \cdot 0.6225 + (-1.0) \cdot 0.2689 = 0.3536,$$
$$\hat{y} = a^{(2)} = \sigma(0.3536) \approx 0.5875.$$
So with this input, this particular set of weights, and this target, the network outputs $\hat{y} \approx 0.5875$ when we wanted $1$. Using mean squared error $\mathcal{L} = \tfrac{1}{2}(\hat{y} - y)^2$, the loss is $\tfrac{1}{2}(0.5875 - 1)^2 \approx 0.0850$. We have stored $\mathbf{z}^{(1)}, \mathbf{a}^{(1)}, z^{(2)}, a^{(2)}$ along the way; we will need every one of them on the backward pass.
Computing the output-layer error
Define the error signal at layer $\ell$ as the partial derivative of the loss with respect to the pre-activation of that layer:
$$\boldsymbol{\delta}^{(\ell)} \;\triangleq\; \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} \in \mathbb{R}^{d_\ell}.$$
Why this particular quantity, rather than $\partial \mathcal{L}/\partial \mathbf{a}^{(\ell)}$ or $\partial \mathcal{L}/\partial \mathbf{W}^{(\ell)}$? Because pre-activations sit at the linear-meets-nonlinear boundary inside each layer: once we know $\boldsymbol{\delta}^{(\ell)}$, the gradients with respect to weights and biases are immediate, and the recursion to the previous layer is a single matrix multiply. The pre-activation is the right place to do the bookkeeping.
Start at the output layer. By the chain rule,
$$\delta^{(L)}_i = \frac{\partial \mathcal{L}}{\partial z^{(L)}_i} = \sum_j \frac{\partial \mathcal{L}}{\partial a^{(L)}_j} \cdot \frac{\partial a^{(L)}_j}{\partial z^{(L)}_i}.$$
If the output activation is applied element-wise (sigmoid, tanh, identity), then $\partial a^{(L)}_j / \partial z^{(L)}_i$ is zero unless $i = j$, in which case it equals $\sigma'(z^{(L)}_i)$. The sum collapses, giving the vector form
$$\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} \mathcal{L} \;\odot\; \sigma'(\mathbf{z}^{(L)}).$$
Two important special cases yield the same elegant simplification.
Mean squared error with linear output. Take $\mathcal{L} = \tfrac{1}{2}\lVert \hat{\mathbf{y}} - \mathbf{y} \rVert^2$ and $\mathbf{a}^{(L)} = \mathbf{z}^{(L)}$ (no nonlinearity at the output). Then $\nabla_{\mathbf{a}^{(L)}} \mathcal{L} = \hat{\mathbf{y}} - \mathbf{y}$ and $\sigma'(\mathbf{z}^{(L)}) = \mathbf{1}$, so $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}$.
Cross-entropy with softmax output. For a $K$-class classifier, $a^{(L)}_k = e^{z^{(L)}_k} / \sum_j e^{z^{(L)}_j}$ and $\mathcal{L} = -\sum_k y_k \log a^{(L)}_k$, where $\mathbf{y}$ is one-hot. The softmax couples its inputs (each output depends on every input), so the trick above does not apply directly; instead, expanding the chain rule and using the algebraic identity $\partial a^{(L)}_k / \partial z^{(L)}_i = a^{(L)}_k(\delta_{ki} - a^{(L)}_i)$, where $\delta_{ki}$ is the Kronecker delta, which equals $1$ when $k=i$ and $0$ otherwise, produces, after a one-line cancellation, the same shape: $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}$.
It is genuinely useful that classification and regression networks both reduce to "predicted minus target" at the output. The shared form is the reason cross-entropy is paired with softmax (and squared error with a linear head) by default in every framework.
For the toy example, our output is a sigmoid with MSE loss. So
$$\delta^{(2)} = (\hat{y} - y)\,\sigma'(z^{(2)}) = (0.5875 - 1) \cdot \sigma(0.3536)\big(1 - \sigma(0.3536)\big) \approx -0.4125 \cdot 0.5875 \cdot 0.4125 \approx -0.0999.$$
The negative sign tells us: to reduce the loss, the pre-activation $z^{(2)}$ should increase (because the loss decreases as $z^{(2)}$ rises, given the chosen target).
Backpropagating through the layers
We now need $\boldsymbol{\delta}^{(\ell)}$ for $\ell < L$. Apply the chain rule. The pre-activation at layer $\ell + 1$ depends on every pre-activation at layer $\ell$ through the formula $z^{(\ell+1)}_k = \sum_j W^{(\ell+1)}_{kj}\, \sigma(z^{(\ell)}_j) + b^{(\ell+1)}_k$. So
$$\delta^{(\ell)}_i = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}_i} = \sum_k \frac{\partial \mathcal{L}}{\partial z^{(\ell+1)}_k} \cdot \frac{\partial z^{(\ell+1)}_k}{\partial z^{(\ell)}_i} = \sum_k \delta^{(\ell+1)}_k \cdot \frac{\partial}{\partial z^{(\ell)}_i}\!\left[ \sum_j W^{(\ell+1)}_{kj}\, \sigma(z^{(\ell)}_j) + b^{(\ell+1)}_k \right].$$
Here $i$ indexes a neuron in layer $\ell$, $k$ indexes a neuron in layer $\ell + 1$, and $j$ indexes the inputs of layer $\ell + 1$ (which are the same as the outputs of layer $\ell$). The inner derivative is non-zero only for $j = i$, giving $W^{(\ell+1)}_{ki}\, \sigma'(z^{(\ell)}_i)$. Therefore
$$\delta^{(\ell)}_i = \left( \sum_k W^{(\ell+1)}_{ki}\, \delta^{(\ell+1)}_k \right) \sigma'(z^{(\ell)}_i).$$
The bracketed sum is the $i$th component of $(\mathbf{W}^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)}$. In matrix form,
$$\boxed{\;\boldsymbol{\delta}^{(\ell)} = \left( \mathbf{W}^{(\ell+1)} \right)^\top \boldsymbol{\delta}^{(\ell+1)} \,\odot\, \sigma'(\mathbf{z}^{(\ell)}).\;}$$
This is the recursive heart of backpropagation. Three complementary ways to think about it.
As a chain-rule application. It is just $\partial \mathcal{L}/\partial \mathbf{z}^{(\ell)}$ written as a sum over paths through layer $\ell + 1$. Each path contributes one factor for the linear map (the weight) and one factor for the activation derivative.
As an error pulled back through transposed weights. On the forward pass, a layer's pre-activations are computed by multiplying the previous activation by $\mathbf{W}^{(\ell+1)}$. On the backward pass, the error at the layer above is multiplied by $(\mathbf{W}^{(\ell+1)})^\top$ to redistribute it across the neurons of the layer below. The transpose appears because we are tracing dependencies in reverse: forward says "neuron $k$ in the next layer takes $W^{(\ell+1)}_{kj}$ times the output of neuron $j$"; backward says "neuron $j$'s output influenced neuron $k$ by exactly that weight, so neuron $j$'s error picks up $W^{(\ell+1)}_{kj}$ times the error at neuron $k$". The Hadamard product with $\sigma'(\mathbf{z}^{(\ell)})$ scales each component by how responsive that neuron's activation is at its current pre-activation: a saturated sigmoid neuron has $\sigma'$ close to zero, killing the signal.
As a tree of computational dependencies. Imagine drawing arrows from each parameter forward through the network to the loss. Backprop traverses those arrows in reverse, accumulating the product of local derivatives along every path. The recursion writes that traversal as a single matrix–vector multiplication per layer, so the total work is linear in the number of edges in the graph.
For the toy example, $\boldsymbol{\delta}^{(2)}$ is the scalar $-0.0999$. Now propagate back:
$$\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)})^\top\, \delta^{(2)} \,\odot\, \sigma'(\mathbf{z}^{(1)}) = \begin{pmatrix} 1.0 \\ -1.0 \end{pmatrix} \cdot (-0.0999) \,\odot\, \begin{pmatrix} \sigma'(0.5) \\ \sigma'(-1.0) \end{pmatrix}.$$
Compute $\sigma'(0.5) = 0.6225 \cdot (1 - 0.6225) \approx 0.2350$ and $\sigma'(-1.0) = 0.2689 \cdot (1 - 0.2689) \approx 0.1966$. Then
$$\boldsymbol{\delta}^{(1)} \approx \begin{pmatrix} -0.0999 \cdot 0.2350 \\ +0.0999 \cdot 0.1966 \end{pmatrix} \approx \begin{pmatrix} -0.0235 \\ +0.0196 \end{pmatrix}.$$
The first hidden neuron, with positive forward weight to the output, picks up the same sign of error as the output; the second, with negative forward weight, picks up the opposite sign. That is exactly what the transposed-weight intuition predicts.
Computing the parameter gradients
Once $\boldsymbol{\delta}^{(\ell)}$ is known for every layer, the gradients with respect to the parameters are immediate. Recall $z^{(\ell)}_i = \sum_j W^{(\ell)}_{ij}\, a^{(\ell-1)}_j + b^{(\ell)}_i$. The chain rule gives
$$\frac{\partial \mathcal{L}}{\partial W^{(\ell)}_{ij}} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}_i} \cdot \frac{\partial z^{(\ell)}_i}{\partial W^{(\ell)}_{ij}} = \delta^{(\ell)}_i\, a^{(\ell-1)}_j,$$
$$\frac{\partial \mathcal{L}}{\partial b^{(\ell)}_i} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}_i} \cdot \frac{\partial z^{(\ell)}_i}{\partial b^{(\ell)}_i} = \delta^{(\ell)}_i.$$
In matrix form (recognising that the outer product of a column vector and a row vector is a rank-one matrix),
$$\boxed{\;\nabla_{\mathbf{W}^{(\ell)}} \mathcal{L} = \boldsymbol{\delta}^{(\ell)} \big(\mathbf{a}^{(\ell-1)}\big)^\top, \qquad \nabla_{\mathbf{b}^{(\ell)}} \mathcal{L} = \boldsymbol{\delta}^{(\ell)}.\;}$$
The shapes work out: $\boldsymbol{\delta}^{(\ell)}$ is $(d_\ell,)$, $\mathbf{a}^{(\ell-1)}$ is $(d_{\ell-1},)$, and the outer product is $(d_\ell,\, d_{\ell-1})$, which matches $\mathbf{W}^{(\ell)}$. Each weight $W^{(\ell)}_{ij}$ has gradient $\delta^{(\ell)}_i\, a^{(\ell-1)}_j$, the local error times the upstream activation. The bias gradient is just the error itself, because the bias enters the pre-activation with coefficient one.
The stochastic gradient descent update then nudges every parameter against its gradient,
$$\mathbf{W}^{(\ell)} \leftarrow \mathbf{W}^{(\ell)} - \eta\, \nabla_{\mathbf{W}^{(\ell)}} \mathcal{L}, \qquad \mathbf{b}^{(\ell)} \leftarrow \mathbf{b}^{(\ell)} - \eta\, \nabla_{\mathbf{b}^{(\ell)}} \mathcal{L},$$
where $\eta > 0$ is the learning rate.
Continue the toy example. We have $\delta^{(2)} \approx -0.0999$, $\boldsymbol{\delta}^{(1)} \approx (-0.0235,\, +0.0196)^\top$, $\mathbf{a}^{(0)} = (1,\, 0)^\top$, and $\mathbf{a}^{(1)} \approx (0.6225,\, 0.2689)^\top$. The output-layer weight gradient is
$$\nabla_{\mathbf{W}^{(2)}} \mathcal{L} = \delta^{(2)}\, (\mathbf{a}^{(1)})^\top \approx -0.0999 \cdot (0.6225,\, 0.2689) \approx (-0.0622,\, -0.0269).$$
So the first output weight ($1.0$) should increase (its gradient is negative), a small step of $\eta = 1.0$ would move it to $1.0622$. The hidden-layer weight gradient is
$$\nabla_{\mathbf{W}^{(1)}} \mathcal{L} = \boldsymbol{\delta}^{(1)} (\mathbf{a}^{(0)})^\top \approx \begin{pmatrix} -0.0235 \\ +0.0196 \end{pmatrix} (1,\, 0) = \begin{pmatrix} -0.0235 & 0 \\ +0.0196 & 0 \end{pmatrix}.$$
The second column is zero because the second input was zero; that input never participated in the forward pass, so it cannot have a gradient. Pick a single weight to update: $W^{(1)}_{11}$ from $0.5$ to $0.5 - 1.0 \cdot (-0.0235) = 0.5235$.
To verify the loss really did decrease, rerun the forward pass with this one weight changed (and all others held fixed). Recompute: $z^{(1)}_1 = 0.5235 \cdot 1 + 1.0 \cdot 0 = 0.5235$, $a^{(1)}_1 = \sigma(0.5235) \approx 0.6280$. The other hidden neuron is unchanged: $a^{(1)}_2 \approx 0.2689$. Then $z^{(2)} = 1.0 \cdot 0.6280 - 1.0 \cdot 0.2689 = 0.3591$, $\hat{y} = \sigma(0.3591) \approx 0.5888$, and $\mathcal{L} \approx \tfrac{1}{2}(0.5888 - 1)^2 \approx 0.0846$. The loss has dropped from $0.0850$ to $0.0846$, exactly as backpropagation promised. (Updating all weights together would move the loss down further still.)
That is the entire algorithm. Run forward, compute $\boldsymbol{\delta}^{(L)}$ at the output, recurse $\boldsymbol{\delta}^{(\ell)} = (\mathbf{W}^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)} \odot \sigma'(\mathbf{z}^{(\ell)})$ from $\ell = L-1$ down to $\ell = 1$, read off gradients $\boldsymbol{\delta}^{(\ell)} (\mathbf{a}^{(\ell-1)})^\top$ and $\boldsymbol{\delta}^{(\ell)}$, take an SGD step. Everything else in modern deep learning, momentum, Adam, weight decay, learning-rate schedules, data parallelism, is decoration on top of these five lines.
Why backprop is efficient
The total cost of one training step is the cost of the forward pass plus the cost of the backward pass. Each pass touches every weight in the network once: the forward pass uses each $W^{(\ell)}_{ij}$ in a multiply-accumulate to build $\mathbf{z}^{(\ell)}$; the backward pass uses each $W^{(\ell+1)}_{ij}$ in a multiply-accumulate to build $\boldsymbol{\delta}^{(\ell)}$, and each $\delta^{(\ell)}_i\, a^{(\ell-1)}_j$ to read off the gradient. Both passes are linear in the number of parameters. Empirically, the backward pass costs about the same as the forward pass, sometimes a touch more, depending on how cheap the activation derivatives are.
Compare this with the alternatives. Finite-difference numerical gradients require one forward pass per parameter. For a million-parameter model, that is a million forward passes per gradient, more than five orders of magnitude slower than backprop. Symbolic differentiation writes the loss as a giant algebraic expression and differentiates the expression; the resulting formulae blow up exponentially with depth (the same sub-expressions are duplicated many times), a problem known as expression swell.
Backpropagation wins because it caches each layer's forward activations and reuses them on the backward pass. Each $\mathbf{a}^{(\ell)}$ is computed once and used twice: once to compute the next pre-activation in the forward direction, once to compute the weight gradient in the backward direction. The same observation explains why modern automatic-differentiation frameworks (PyTorch's autograd, JAX's grad, TensorFlow's GradientTape) all implement the reverse mode of automatic differentiation: it is the same algorithm as backpropagation, generalised from MLPs to arbitrary computational graphs. Whenever you see a vast neural network train, reverse-mode AD is doing the work.
Vectorised backprop with mini-batches
In practice we rarely process a single example at a time. We process a mini-batch of $B$ examples in parallel because matrix multiplication is much faster per-flop on a GPU at large sizes. The equations all generalise cleanly. Stack the activations of $B$ examples as columns of a matrix:
$$\mathbf{A}^{(\ell)} \in \mathbb{R}^{d_\ell \times B}, \qquad \mathbf{Z}^{(\ell)} \in \mathbb{R}^{d_\ell \times B}, \qquad \boldsymbol{\Delta}^{(\ell)} \in \mathbb{R}^{d_\ell \times B}.$$
The forward pass becomes $\mathbf{Z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{A}^{(\ell-1)} + \mathbf{b}^{(\ell)}$ (the bias broadcasts across the $B$ columns), $\mathbf{A}^{(\ell)} = \sigma(\mathbf{Z}^{(\ell)})$. The backward recursion is $\boldsymbol{\Delta}^{(\ell)} = (\mathbf{W}^{(\ell+1)})^\top \boldsymbol{\Delta}^{(\ell+1)} \odot \sigma'(\mathbf{Z}^{(\ell)})$. The parameter gradients sum over the batch:
$$\nabla_{\mathbf{W}^{(\ell)}} \mathcal{L}_{\text{batch}} = \boldsymbol{\Delta}^{(\ell)} \big(\mathbf{A}^{(\ell-1)}\big)^\top, \qquad \nabla_{\mathbf{b}^{(\ell)}} \mathcal{L}_{\text{batch}} = \boldsymbol{\Delta}^{(\ell)} \mathbf{1}_B,$$
where $\mathbf{1}_B$ is a column of $B$ ones, summing the per-example bias gradients. The matrix product $\boldsymbol{\Delta}^{(\ell)} (\mathbf{A}^{(\ell-1)})^\top$ has shape $(d_\ell, d_{\ell-1})$ regardless of $B$, so the gradient lives in the same space as the weight matrix, just summed across the batch. If we are using a mean batch loss rather than a sum (the more common choice), divide by $B$ once at the output delta and the recursion automatically threads the $1/B$ factor through every subsequent gradient.
What you should take away
- Backpropagation is the chain rule applied recursively, layer by layer, from the output of the network toward the input. It is not a separate piece of mathematics; it is bookkeeping.
- The trick that makes it efficient is to compute and cache one error signal per layer, $\boldsymbol{\delta}^{(\ell)} = \partial \mathcal{L} / \partial \mathbf{z}^{(\ell)}$, and reuse it for the recursion and for the parameter gradients.
- The recursion has a memorable form, $\boldsymbol{\delta}^{(\ell)} = (\mathbf{W}^{(\ell+1)})^\top \boldsymbol{\delta}^{(\ell+1)} \odot \sigma'(\mathbf{z}^{(\ell)})$, error pulled back through transposed weights, masked by local activation responsiveness.
- Total cost is linear in the number of parameters: roughly two forward passes, regardless of network depth or width. That is what made deep learning practical at scale.
- Classification networks enjoy a small algebraic miracle: cross-entropy with softmax outputs gives the same simple output error $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}$ as squared error with a linear head, which is why these pairings are the default in every framework.
- The same five-line algorithm scales from a 2-neuron toy example to a 70-billion-parameter language model, and modern frameworks (PyTorch, JAX) automate it via reverse-mode automatic differentiation, but the chain rule, transposed weights and per-layer error signal are still doing all the work underneath.