Backpropagation, Glossary, Textbook of AI

Backpropagation is the algorithm for computing the gradient of a loss function with respect to every weight in a multilayer neural network. It applies the chain rule of calculus recursively, working backward from the loss through the layers, computing local gradients at each node and propagating them down. For a network with L layers, backpropagation computes the full gradient in time O(forward inference), the algorithmic miracle that makes training large networks tractable.

The algorithm has been independently derived several times: by Henry Kelley (1960) and Arthur Bryson (1961) as a control-theoretic adjoint method, by Seppo Linnainmaa (1970) for general computational graphs, by Paul Werbos (1974) in his Harvard PhD thesis specifically for neural networks, by David Parker (1985) and Yann LeCun (1985) independently. The 1986 Nature paper by Rumelhart, Hinton and Williams was the version the AI community read; the demonstration that backprop could discover useful internal representations triggered the late-1980s connectionist revival.

Modern deep-learning frameworks (PyTorch, TensorFlow, JAX) implement backpropagation as automatic differentiation in a computation graph. The forward pass records every operation; the backward pass walks the graph in reverse, applying each operation's predefined gradient rule. The user can write any differentiable function and obtain its gradient automatically, freeing researchers from the error-prone manual derivations that characterised pre-2010 neural-network research.

Backpropagation faces well-known difficulties, vanishing and exploding gradients in deep networks (mitigated by ReLU activations, residual connections, batch normalisation), saddle points (the dominant difficulty in high-dimensional optimisation, generally escaped by stochastic gradient descent), and the need for differentiable architectures (which precludes some discrete operations). Despite these limits, every neural network trained today is trained by backpropagation or a close variant.

Mathematics

Consider a feed-forward network with $L$ layers. The forward pass at layer $l$ computes pre-activations $z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$ and activations $a^{(l)} = \phi(z^{(l)})$, with $a^{(0)} = x$ the input. The loss $\mathcal{L}$ depends on the final activation $a^{(L)}$ and the target $y$.

Define the error signal $\delta^{(l)} = \partial \mathcal{L} / \partial z^{(l)}$. The chain rule gives the recurrence

$$\delta^{(L)} = \nabla_{a^{(L)}} \mathcal{L} \odot \phi'(z^{(L)})$$

$$\delta^{(l)} = \bigl((W^{(l+1)})^\top \delta^{(l+1)}\bigr) \odot \phi'(z^{(l)})$$

where $\odot$ is element-wise multiplication. Once the $\delta^{(l)}$ are computed, the parameter gradients are

$$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^\top, \qquad \frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)}.$$

The crucial efficiency insight: each $\delta^{(l)}$ is computed once and reused for every parameter in layer $l$, giving total cost $O(L \cdot p)$ where $p$ is the parameter count, exactly the cost of one forward pass. This makes training a network with billions of parameters tractable.

In modern frameworks (PyTorch, JAX), backpropagation is implemented as reverse-mode automatic differentiation: every primitive operation declares its vector-Jacobian product $\bar y \mapsto \bar y \cdot J$, and the framework composes these in reverse order through the computation graph. The user writes any differentiable function and the gradient is computed automatically.

Interactive

One neuron, forward pass and backward pass. Forward pass produces a value; backprop sends gradients back through the same graph.

The chain rule on a computation graph. Gradients multiply backward along the edges of a small graph from output to input.

Gradients flow backward through the layers. Forward pass produces a loss. Reverse pass propagates the gradient through every layer, multiplying local Jacobians.

Video

Related terms: Chain Rule, Gradient Descent, Neural Network, geoffrey-hinton, paul-werbos

Discussed in:

Chapter 9: Neural Networks, Training Neural Networks
Chapter 10: Training & Optimisation, Backpropagation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.