Chain Rule, Glossary, Textbook of AI

The chain rule of calculus gives the derivative of a composition of functions. For scalar functions $f$ and $g$,

$$\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x).$$

For multivariable functions $f: \mathbb{R}^n \to \mathbb{R}^m$ and $g: \mathbb{R}^m \to \mathbb{R}^k$, the chain rule says that the Jacobian of the composition is the product of Jacobians:

$$J_{g \circ f}(x) = J_g(f(x)) \cdot J_f(x).$$

A neural network is a composition of layers $f_L \circ f_{L-1} \circ \cdots \circ f_1$. Its loss $L = \ell(f_L(\cdots f_1(x)))$ as a function of any layer's parameters $\theta_l$ has gradient computable by repeated application of the chain rule:

$$\frac{\partial L}{\partial \theta_l} = \frac{\partial L}{\partial f_L} \cdot \frac{\partial f_L}{\partial f_{L-1}} \cdots \frac{\partial f_{l+1}}{\partial f_l} \cdot \frac{\partial f_l}{\partial \theta_l}.$$

Computing this naively layer-by-layer would cost $O(L \cdot p)$ per layer, where $p$ is the parameter count, quadratic in network depth. Backpropagation is the algorithm that computes this gradient in time linear in the network's forward cost, by computing intermediate quantities $\delta_l = \partial L / \partial f_l$ once and reusing them for every parameter in layer $l$.

Modern automatic-differentiation frameworks (PyTorch, JAX, TensorFlow) implement the chain rule as a computation graph: every primitive operation declares its vector-Jacobian product function (VJP), and the framework composes these in the reverse direction during the backward pass. This enables researchers to write any differentiable function and obtain its gradient without manual derivation.

The chain rule extends to higher-order derivatives via Faà di Bruno's formula and to non-Euclidean manifolds via the differential structure , both relevant to higher-order optimisation methods (Newton's method, natural gradients) and Riemannian neural networks.

Interactive

One neuron, forward pass and backward pass. Forward pass produces a value; backprop sends gradients back through the same graph.

The chain rule on a computation graph. Gradients multiply backward along the edges of a small graph from output to input.

Gradients flow backward through the layers. Forward pass produces a loss. Reverse pass propagates the gradient through every layer, multiplying local Jacobians.

Video

Related terms: Backpropagation, Gradient

Discussed in:

Chapter 10: Training & Optimisation, Backpropagation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.