Glossary

Chain Rule

The Chain Rule is the single most important result in calculus for machine learning. It tells us how to differentiate a composition of functions. In its simplest form, if $y = f(g(x))$, then $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$. The derivative of the composition is the product of the derivatives, each evaluated at the appropriate input. The rule extends naturally to compositions of any depth: $\frac{d}{dx}f(g(h(x))) = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)$.

Why is the chain rule so important for AI? Because a neural network is a deeply nested composition of functions: the input is multiplied by a weight matrix, passed through a nonlinearity, multiplied by another matrix, passed through another nonlinearity, and so on. Computing the gradient of the loss with respect to any particular weight requires differentiating through every subsequent layer. The chain rule provides the mathematical machinery to do so systematically.

Backpropagation is simply the chain rule applied algorithmically to the computational graph of a neural network. During the backward pass, gradients are computed layer by layer, starting from the output and working backward, reusing intermediate derivatives via dynamic programming. This makes gradient computation cost only about twice as much as a forward pass—a fact that makes training deep networks with billions of parameters computationally tractable. Without the chain rule, deep learning as we know it would not exist.

Related terms: Backpropagation, Derivative, Gradient

Discussed in:

Also defined in: Textbook of AI