Backpropagation is the algorithm at the heart of training neural networks. It provides an efficient method for computing the gradient of the loss function with respect to every weight in a neural network, enabling gradient-based optimisation of models with millions or billions of parameters. The algorithm is simply the chain rule of calculus applied systematically to the computational graph of the network.
The algorithm has two phases. In the forward pass, the network computes its output layer by layer, storing the intermediate activations. In the backward pass, the gradient of the loss is propagated backward through the network: at each layer, the incoming gradient is multiplied by the local Jacobian to produce the outgoing gradient. Mathematically, the weight gradient at layer $l$ is:
$$\frac{\partial L}{\partial W_l} = \boldsymbol{\delta}l \mathbf{h}{l-1}^T$$
where $\boldsymbol{\delta}l$ is the error signal recursively defined in terms of $\boldsymbol{\delta}{l+1}$ and the layer's Jacobian.
The elegance of backpropagation is its efficiency. A naive approach would cost $O(P)$ forward passes to estimate the gradient with respect to each of $P$ parameters. Backpropagation computes the entire gradient in a single backward pass costing only about twice as much as a forward pass, by reusing intermediate derivatives via dynamic programming. Modern deep learning frameworks implement backpropagation via automatic differentiation, handling arbitrary architectures with no manual derivation. Without backpropagation, training deep networks at scale would be impossible.
Related terms: Chain Rule, Gradient Descent, Neural Network
Discussed in:
- Chapter 9: Neural Networks — Backpropagation
- Chapter 3: Calculus — The Chain Rule
Also defined in: Textbook of AI, Textbook of Medical AI