Calculus: 3.8   Forward-mode versus reverse-mode AD

Dr Chris Paton

3.8 Forward-mode versus reverse-mode AD

There are exactly two sensible ways to apply the chain rule across a computational graph, and modern automatic-differentiation libraries implement both. The first, forward mode, propagates tangents alongside ordinary values as the computation runs left to right. The second, reverse mode, runs the computation forwards once to record what happened, then propagates adjoints backwards from the output to the inputs. Reverse mode is what backpropagation uses; forward mode comes into its own when there are very few inputs and very many outputs. The distinction is not pedantic: it explains why training a deep network feels like a backward pass, why scientific simulations sometimes use a different flavour of differentiation, and why frameworks such as JAX expose both jvp and vjp as first-class operations.

In §3.7 we saw reverse-mode AD as the engine that powers backpropagation through a graph of additions and multiplications. This section places that result in its proper context. We will see that reverse mode is not the only way, merely the correct way for the shape of problem that machine learning hands us. By the end of the section you should be able to look at any differentiation problem and say, before writing any code, which mode is going to be cheaper.

Symbols Used Here

$\dot v$tangent: $\partial v/\partial x_k$ for one chosen input $x_k$

$\bar v$adjoint: $\partial \mathcal{L}/\partial v$ for fixed scalar output $\mathcal{L}$

$n$number of inputs

$m$number of outputs

Forward mode

Forward mode is the closest thing to the chain rule as you first learnt it at school. Imagine sweeping through the computational graph from inputs to outputs, exactly as you would when evaluating the function. Alongside each numerical value $v$ you carry a second number, the tangent $\dot v$, which represents the partial derivative of $v$ with respect to one chosen input $x_k$. The pair $(v, \dot v)$ travels through every operation as a single bundle; this is sometimes called a dual number.

The propagation rule is the chain rule applied locally. If a node $v_{\text{out}}$ depends on parents $v_1, v_2, \dots, v_p$, then

$$ \dot v_{\text{out}} = \sum_p \frac{\partial v_{\text{out}}}{\partial v_p} \, \dot v_p. $$

To start the computation, we seed the tangents. We choose one input $x_k$ and set $\dot x_k = 1$, while every other input gets $\dot x_j = 0$. The interpretation is that we are differentiating with respect to $x_k$ alone. As the forward sweep unfolds, the tangent attached to every intermediate node tells us how that node would change if $x_k$ were nudged by an infinitesimal amount. When the sweep reaches the output, its tangent is exactly the partial derivative of the output with respect to $x_k$.

So one forward pass with the seed $\dot x_k = 1$ yields one column of the Jacobian. If you want the full gradient with respect to all $n$ inputs, you must repeat the forward pass $n$ times, each time with a different seed. The cost is therefore $O(n)$ multiplied by the cost of a single forward pass.

Forward mode is simple to implement. There is no graph to record, no tape to walk backwards along, no extra memory beyond the tangent attached to each value. Because every tangent is computed at the moment its parent value is computed, the implementation can be a thin wrapper around ordinary arithmetic. Most undergraduate AD tutorials begin here, and many small numerical libraries support nothing else. Its conceptual purity, though, comes at a cost the moment $n$ becomes large.

Reverse mode

Reverse mode flips the direction of travel. We first run the computation forwards once, exactly as we would if no differentiation were needed, but we record every value and every operation as we go. This record is sometimes called the tape. Once the forward pass is complete and the scalar output $y$ is known, we begin a second sweep, this time from the output back towards the inputs.

We initialise the adjoint of the output as $\bar y = 1$. The adjoint of any other node $v$ is defined as $\bar v = \partial y / \partial v$, the sensitivity of the output to that intermediate value. The propagation rule, walked backwards along the recorded tape, is

$$ \bar v \mathrel{+}= \frac{\partial v_{\text{out}}}{\partial v} \, \bar v_{\text{out}}. $$

The accumulation matters: a node may feed several downstream consumers, and each contributes its share to the adjoint. Once the backward sweep reaches an input, the adjoint at that input is the partial derivative of $y$ with respect to that input.

The remarkable feature is that every input gets its derivative in a single backward sweep. The total cost is one forward pass plus one backward pass, which empirically lands somewhere between two and three times the cost of a forward pass alone, regardless of how many inputs there are. Whether the model has ten parameters or ten billion, the cost of obtaining the full gradient is essentially the same multiple of one forward evaluation. This is the single reason large neural networks can be trained at all.

We do, of course, pay for this favourable runtime with memory. Every intermediate value used by the backward pass must be kept alive until the gradient has flowed through it. For deep networks this can dominate the memory budget on the GPU, and a great deal of engineering work goes into reducing it: gradient checkpointing trades a little extra computation for much less memory by recomputing activations during the backward pass; mixed-precision training stores activations in sixteen-bit floats; and activation offloading parks them on the host while the GPU is busy elsewhere. None of this changes the shape of the algorithm, only its memory footprint.

When forward mode wins

Forward mode dominates when $m \gg n$, that is, when the function has few inputs and many outputs. Consider a Jacobian shaped like a short, wide rectangle. Forward mode fills it column by column, each column costing one forward sweep, so the total cost is proportional to $n$. Reverse mode would fill it row by row, each row costing one backward sweep, so the total cost would be proportional to $m$. When $n$ is small and $m$ is large, $n \ll m$, forward mode wins outright.

Two examples make this concrete. First, suppose you have a one-dimensional simulation that produces a thousand-dimensional time series, a single rate constant feeding a stiff ODE solver, say. You want to know how every component of the output series responds to that single input. Forward mode does it in one sweep with $n = 1$ seed. Reverse mode would need a thousand backward sweeps, one per output. Second, imagine a small parametric model with a handful of hyperparameters and a richly structured output, perhaps a render of an image. Forward mode is again the natural choice.

Many problems in scientific computing, climate modelling, and engineering design have exactly this shape. Sensitivity analysis often asks how a high-dimensional output depends on a low-dimensional set of parameters, and forward mode is, mode for mode, the right tool. The same is true of computer-graphics differentiation when only a few scene variables are being tuned, of small-parameter physics simulations checked against large data tensors, and of any setting where you want the directional derivative of a vector-valued function in a single named direction. In those cases reverse mode would either waste effort or, worse, force you to keep an enormous intermediate tape in memory just to recover information that forward mode hands over for free.

When reverse mode wins

Reverse mode dominates when $n \gg m$. The Jacobian here is tall and thin. Reverse mode harvests one row per backward sweep, costing $O(m)$ in total, while forward mode would chew through $O(n)$ forward sweeps to assemble the same information.

Machine learning is the canonical case, and it is canonical by an enormous margin. A modern neural network has $n$ on the order of millions, billions, or even trillions of parameters; the loss is a single scalar, so $m = 1$. Forward mode would require us to run the network once for each parameter, which is hopeless: a billion-parameter model would need a billion forward passes to extract a single gradient, which at any plausible compute budget would take longer than the age of the universe. Reverse mode runs the network twice, once forwards and once backwards, and delivers the gradient for the entire parameter vector. This is the reason backpropagation became, and remains, the central algorithm of deep learning. Whenever you see a training loop call loss.backward(), you are watching reverse mode pay back its memory debt and produce a complete gradient in a single sweep.

Forward and reverse can be combined

The two modes are not rivals; they compose. The classical example is the Hessian-vector product, a quantity that appears in second-order optimisation, conjugate-gradient solvers, and methods such as natural gradient descent. The Hessian itself is an $n \times n$ matrix and is hopelessly expensive to materialise when $n$ is large, but the product $Hv$ for a chosen vector $v$ can be obtained without ever forming $H$.

The trick is forward-over-reverse. Reverse mode gives us the gradient $g(\theta) = \nabla \mathcal{L}(\theta)$ as a function of the parameters. We then apply forward mode to that gradient function, seeding the tangent direction with $v$. The resulting output tangent is precisely $Hv$, computed in roughly the cost of two ordinary backward passes. PyTorch exposes this through torch.autograd.functional.hessian and torch.autograd.functional.hvp; JAX exposes jax.hvp, built directly out of jax.grad and jax.jvp.

The opposite composition, reverse-over-forward, is also valid and sometimes preferable when the underlying computation has a particularly tall Jacobian. Real autodiff systems, especially JAX, treat both modes as primitives and expect users to compose them freely: gradients of gradients, vector-Jacobian products of Jacobian-vector products, and arbitrarily higher derivatives all fall out of the same machinery. Almost every gradient you meet from §3.9 onwards in this book is computed in reverse mode, because every loss is a scalar; but it is good to know the toolkit is wider than that.

What you should take away

Forward mode propagates tangents alongside values; reverse mode propagates adjoints backwards along a recorded tape.
Forward mode costs $O(n)$ forward sweeps to produce a full Jacobian; reverse mode costs $O(m)$ backward sweeps.
Reverse mode wins when $n \gg m$, which is the regime of every scalar-loss training problem in machine learning.
Forward mode wins when $m \gg n$, which is common in scientific computing and sensitivity analysis.
The two modes compose: forward-over-reverse gives Hessian-vector products at roughly twice the cost of an ordinary gradient, which is what enables practical second-order methods.