Calculus: 3.7   Reverse-mode automatic differentiation (backpropagation)

Dr Chris Paton

3.7 Reverse-mode automatic differentiation (backpropagation)

Backpropagation is the algorithm that powers every modern neural network. Whenever you read that GPT-4, Claude, Gemini, AlphaFold, or Stable Diffusion was "trained", what happened underneath is that someone ran backpropagation many billions of times. It is the single most important computational procedure in artificial intelligence, and yet, once you have seen it twice, it is not really very mysterious. It is simply the chain rule, applied carefully, to a recorded list of arithmetic steps.

In the language we have built up over the previous sections, backpropagation is the special case of reverse-mode automatic differentiation that we apply to a computational graph whose final node produces one number, the loss. The loss measures how badly the network performed on some training example, and the gradients computed by backpropagation tell us exactly how to nudge each of the network's parameters so that the loss goes down. This section gives the algorithm in its most general form, walks through a small numerical example by hand, and then surveys the local rules and the engineering issues that arise when the graph contains millions or billions of parameters.

This section makes the computational graph of §3.6 differentiable by attaching a small piece of calculus to each node. §3.8 contrasts reverse-mode autodiff with forward-mode; §3.9 uses the gradients computed here to move the parameters via gradient descent.

Symbols Used Here

$v$value at a node in the computational graph

$\bar v$adjoint: $\bar v = \partial \mathcal{L}/\partial v$

$\mathcal{L}$final scalar loss

$\mathbf{f}$local function at a node, $v = f(v_1, v_2, \ldots)$

The algorithm in three lines

Reverse-mode automatic differentiation can be stated in three steps. They are deceptively short.

Run the forward pass, recording every intermediate value at each node. This is just normal arithmetic: feed the inputs in, follow the arrows of the graph, and write down what each node computes. The recording matters. We will need those numbers later.
Initialise the adjoint of the output: $\bar{\mathcal{L}} = 1$. The reasoning is purely formal. The adjoint $\bar v$ is defined to be $\partial \mathcal{L}/\partial v$, the partial derivative of the loss with respect to the value at that node. The derivative of $\mathcal{L}$ with respect to itself is $1$, by definition.
Walk the graph in reverse topological order, that is, from the loss back towards the inputs, visiting each node only after every node downstream of it has been visited. For each node $v$ that fed into nodes $w_1, w_2, \ldots$, accumulate $$ \bar v \mathrel{+}= \sum_k \frac{\partial w_k}{\partial v} \cdot \bar w_k. $$ The accumulation symbol $\mathrel{+}=$ matters: a node may have many children, and each contributes a separate term. The local derivatives $\partial w_k/\partial v$ depend only on the small piece of arithmetic at the child node, never on the rest of the graph.

When the walk finishes, every node's adjoint $\bar v$ holds $\partial \mathcal{L}/\partial v$. In particular, the adjoints of the leaf nodes, the inputs and the parameters, give us exactly the gradient we wanted. There is no need for symbolic algebra, no need for finite differences, no need to derive a single formula by hand. We only need to know how to take a derivative of each elementary operation, and then to glue those local pieces together with the chain rule. That is the whole idea.

The reason the recipe works is the multivariable chain rule from §3.5. Each path from a node $v$ up to the loss contributes one term to $\partial \mathcal{L}/\partial v$, and the chain rule tells us that these contributions add. The reverse walk visits every path exactly once and accumulates each contribution into the parent's adjoint, so by the time we reach the leaves, we have summed over all paths through the graph.

Why "reverse mode"?

The label is descriptive. We push gradients from the output back to the inputs, in the opposite direction to the forward pass. The forward pass goes parameters → loss; the backward pass goes loss → parameters. Forward-mode autodiff, which we will meet in §3.8, instead pushes tangents, partial derivatives with respect to one chosen input, forward through the graph in the same direction as the values. Reverse mode pushes adjoints, gradients with respect to one chosen output, backward through the graph.

Why bother with two flavours? Because the cost depends sharply on the shape of the function you are differentiating. If your function is $f: \mathbb{R}^n \to \mathbb{R}^m$, that is, $n$ inputs and $m$ outputs, then forward mode costs about $n$ forward passes (one per input, each propagating a tangent), and reverse mode costs about $m$ forward passes worth of work (one per output). Neural-network training is the extreme case where $n$ is enormous (billions of parameters) and $m$ is one (the scalar loss). Reverse mode is then unbeatable.

Some concrete numbers help. For a one-billion-parameter model, one forward + one backward pass is roughly twice the cost of a forward pass alone, call it 2×, and gives us all one billion gradients in one go. Forward mode would need one billion separate runs of the network to get the same gradients, each costing one forward pass. The ratio is on the order of half a billion. That is the difference between training a modern language model in weeks and never being able to train one at all.

The cost ratio is sometimes called the cheap-gradient principle: gradients of a scalar function cost only a small constant times the cost of evaluating the function itself. This is the single most important fact about reverse mode and the reason it underpins all of modern deep learning. In practical terms, the constant is usually between two and four, depending on how much memory the framework has to spend recomputing intermediate values.

You should also notice what reverse mode is not. It is not a numerical approximation. The output of reverse-mode autodiff agrees with the analytic gradient up to floating-point round-off, the same way a hand-derived formula does. It is not slow. It is the asymptotically optimal way to compute the gradient of a scalar function with respect to many parameters. And it is not specific to neural networks: any function expressible as a computational graph of differentiable primitives can be backpropagated, including physics simulators, rendering pipelines, and SAT solvers with relaxed constraints.

Worked example: $f(x_1, x_2) = (x_1 + x_2) \cdot x_1$

Let us take the algorithm out for a spin on a function small enough to do entirely by hand. Set $x_1 = 3$ and $x_2 = 4$. The function is

$$ f(x_1, x_2) = (x_1 + x_2) \cdot x_1. $$

Step one: build the computational graph. We name each intermediate so we can refer to it. Let $$ v_1 = x_1, \quad v_2 = x_2, \quad v_3 = v_1 + v_2, \quad v_4 = v_3 \cdot v_1. $$ The output of the graph is $v_4$, which equals $f$. Notice that $v_1$ feeds two consumers: it appears once as a summand in $v_3$ and again as a multiplier in $v_4$. This is the case where the chain rule has to sum over multiple paths, and it is the most useful aspect of the example.

Step two: forward pass. Substitute the numbers in. $$ v_1 = 3, \quad v_2 = 4, \quad v_3 = 3 + 4 = 7, \quad v_4 = 7 \cdot 3 = 21. $$ We have $f(3, 4) = 21$. Crucially, we record all four numbers. We are about to need $v_1, v_3$ in the backward pass; if we had thrown them away after computing $v_4$ we would not be able to take the gradient.

Step three: backward pass. Initialise the adjoint of the output: $\bar v_4 = 1$. Now walk the graph backwards. The reverse topological order is $v_4, v_3, v_2, v_1$.

Visit $v_4 = v_3 \cdot v_1$. The local partials are $\partial v_4 / \partial v_3 = v_1 = 3$ and $\partial v_4 / \partial v_1 = v_3 = 7$. Accumulate the adjoint contributions: $$ \bar v_3 \mathrel{+}= 3 \cdot \bar v_4 = 3, \qquad \bar v_1 \mathrel{+}= 7 \cdot \bar v_4 = 7. $$ After this step, $\bar v_3 = 3$ and $\bar v_1 = 7$ (they were zero before).

Visit $v_3 = v_1 + v_2$. The local partials are $\partial v_3 / \partial v_1 = 1$ and $\partial v_3 / \partial v_2 = 1$. Accumulate: $$ \bar v_1 \mathrel{+}= 1 \cdot \bar v_3 = 3, \qquad \bar v_2 \mathrel{+}= 1 \cdot \bar v_3 = 3. $$ After this step, $\bar v_1 = 7 + 3 = 10$ and $\bar v_2 = 3$. The accumulation into $\bar v_1$ is exactly the place where the multivariable chain rule sums over the two paths from $v_1$ to $v_4$.

The remaining nodes $v_1, v_2$ are leaves; they have no upstream computation to delegate to. The walk is finished.

Final adjoints: $$ \bar v_1 = 10 = \frac{\partial f}{\partial x_1}, \qquad \bar v_2 = 3 = \frac{\partial f}{\partial x_2}. $$

Verification by elementary calculus. Expand the function: $f = (x_1 + x_2) x_1 = x_1^2 + x_1 x_2$. Then $\partial f / \partial x_1 = 2 x_1 + x_2 = 2 \cdot 3 + 4 = 10$ and $\partial f / \partial x_2 = x_1 = 3$. The hand-derived gradient and the autodiff gradient agree exactly. The whole point of automatic differentiation is that this agreement is guaranteed by construction, no matter how complicated the graph becomes.

Local Jacobians for common operations

In real software the work of writing the backward pass is reduced to choosing the right rule for each kind of node. Each rule says: if the forward operation looks like this, then the contribution from each input adjoint to its parent's adjoint looks like that. Together these rules form a small library, and a complete framework such as PyTorch or JAX is essentially that library plus a graph-walking driver.

Below is the working set of rules for the operations that occur in nearly every neural network. The pattern is always the same: the upstream adjoint $\bar v$ is the gradient of the loss with respect to the output of the node, and we write down what each input adjoint receives.

Add ($v = a + b$): $\bar a \mathrel{+}= \bar v$, $\bar b \mathrel{+}= \bar v$. Addition just copies the gradient unchanged to both inputs. Each summand contributes equally to the output, so each carries an equal share of the blame.
Subtract ($v = a - b$): $\bar a \mathrel{+}= \bar v$, $\bar b \mathrel{-}= \bar v$. The minuend behaves like an addition; the subtrahend gets the negative.
Multiply ($v = a \cdot b$): $\bar a \mathrel{+}= b \cdot \bar v$, $\bar b \mathrel{+}= a \cdot \bar v$. Each input is scaled by the value of the other. This is why we had to record forward values: the backward formula refers to them.
Divide ($v = a / b$): $\bar a \mathrel{+}= \bar v / b$, $\bar b \mathrel{-}= a \bar v / b^2$. Quotient rule, written as a backward update.
Power ($v = a^n$, with constant $n$): $\bar a \mathrel{+}= n \, a^{n-1} \bar v$. The familiar derivative of a power.
Exponential ($v = e^a$): $\bar a \mathrel{+}= v \cdot \bar v$. We can reuse the already-computed forward value $v = e^a$ instead of recomputing the exponential.
Logarithm ($v = \ln a$): $\bar a \mathrel{+}= \bar v / a$. Provided $a > 0$, of course.
Sigmoid ($v = \sigma(a) = 1/(1 + e^{-a})$): $\bar a \mathrel{+}= v(1 - v) \bar v$. The derivative of the sigmoid has the elegant form $\sigma'(a) = \sigma(a)(1 - \sigma(a))$, so once again we can reuse the forward value and avoid extra exponentials.
ReLU ($v = \max(0, a)$): $\bar a \mathrel{+}= \mathbb{1}[a > 0] \cdot \bar v$. The gradient passes through unchanged when the input is positive and is killed to zero otherwise. The non-differentiability at exactly zero is conventionally resolved by setting the derivative to zero or to one, in practice it almost never matters, because exact zeros essentially never occur in floating-point arithmetic.
Matrix multiplication ($\mathbf{V} = \mathbf{A} \mathbf{B}$): $\bar{\mathbf{A}} \mathrel{+}= \bar{\mathbf{V}} \mathbf{B}^\top$, $\bar{\mathbf{B}} \mathrel{+}= \mathbf{A}^\top \bar{\mathbf{V}}$. Two further matrix multiplies, with the partners transposed. This single rule, applied at every linear layer of every neural network in the world, is responsible for most of the floating-point operations on the planet. Because the rule is itself a matrix multiply, it can run on the same GPU tensor cores as the forward pass.

These rules are independent of the rest of the graph. You can study and verify each one in isolation. The framework's job is to look up the right rule for each node it encounters during the backward walk and to plumb the adjoints from one node to the next.

Memory cost

The forward pass uses memory proportional to the depth of the graph if you only need the final value. The backward pass needs much more, because the local rules above repeatedly refer to forward-pass values. Multiply needs $a$ and $b$; sigmoid needs $\sigma(a)$; matrix multiply needs $\mathbf{A}$ and $\mathbf{B}$. In the standard implementation, every intermediate activation produced by the forward pass has to be retained until its corresponding backward step has run.

For a 100-layer transformer with batch size 32, sequence length 2048, and hidden width 4096, the activations alone occupy tens of gigabytes. Memory, not arithmetic, is the binding constraint on training large models. Three families of mitigation are common.

Gradient checkpointing: instead of storing every activation, store only a sparse subset (every $k$th layer, say) and recompute the rest by re-running the forward pass during the backward walk. The compute cost rises by roughly 30% but the memory drops to $O(\sqrt{N})$ in the depth $N$. This is the default in most large-model training pipelines.

Mixed precision: store activations in bfloat16 or float16 instead of float32, halving the memory bill. A small number of sensitive accumulators are kept at higher precision to preserve numerical fidelity.

Activation offloading: stream activations off the GPU to CPU memory or even to NVMe storage during the forward pass and stream them back in for the backward pass. The bandwidth cost is real but, for the largest models, it is the only way to fit the activations at all.

How frameworks implement it

PyTorch implements reverse-mode autodiff with a dynamic graph, also called define-by-run. Every tensor created with requires_grad=True records the operation that produced it, along with references to its parents. When you call loss.backward(), the framework walks this recorded graph in reverse topological order and accumulates gradients into a .grad attribute on every leaf tensor. The graph can change shape from one iteration to the next, different sequence lengths, conditional branches, even Python loops, because it is rebuilt every forward pass.

JAX takes a different route. It traces a Python function once, building a static intermediate representation, and then exposes a small set of pure transformations: grad for the gradient, jit for compilation, vmap for batched evaluation, pmap for distributed computation. The static graph is more rigid than PyTorch's, but the rigidity enables aggressive XLA compilation and very efficient kernel fusion. JAX is the framework of choice in much of Google's research output and increasingly in the alignment and interpretability community.

TensorFlow now offers both styles (eager mode and tf.function), MLX from Apple is dynamic in the PyTorch style, and small educational libraries like Andrej Karpathy's micrograd implement the same ideas in a few hundred lines. Modulo numerical artefacts at the rounding-error level, all four produce identical gradients. Where they differ is in the engineering: graph capture, kernel fusion, distributed training, hardware support, and the amount of ceremony required to use them. For research, PyTorch is the dominant choice today; for production at the largest scale, the choice is usually whatever the deployment team is already comfortable with.

What you should take away

Backpropagation is reverse-mode automatic differentiation applied to a scalar loss. It is not a separate algorithm. It is one specific use of a more general idea.
The recipe is three lines: run the forward pass and record everything; set the output adjoint to one; walk the graph in reverse, accumulating each child's local gradient into its parents.
Reverse mode is cheap: it gives you all the gradients of a scalar function at the cost of about two forward passes. This is the cheap-gradient principle, and it is the reason large neural networks are trainable at all.
The library of local rules is small. Add, multiply, matmul, the activation functions, a handful of others, that is enough to differentiate any standard network. Each rule can be checked in isolation.
Memory, not arithmetic, is the practical bottleneck. Activation storage dominates the memory bill of training, and the techniques used to manage it, gradient checkpointing, mixed precision, offloading, are first-class engineering concerns, not afterthoughts.