Calculus: 3.5   The chain rule for vector-valued functions

Dr Chris Paton

3.5 The chain rule for vector-valued functions

If you remember only one slogan from this chapter, make it this: backpropagation is just the chain rule, applied carefully and repeatedly. Train a neural network with a billion parameters and what the computer is really doing, deep down, is the same chain rule you met in school applied a few hundred times in a row. The trick is that schoolroom calculus deals with one number going in and one number coming out, whereas a neural network passes vectors of hundreds or thousands of numbers from layer to layer. To handle that we need a slightly bigger version of the chain rule: the multi-variable, multi-output version expressed using objects called Jacobians. Once we have it, the whole of backpropagation will fall out almost mechanically.

Where §3.4 covered scalar-output functions of vectors, this section steps up: many inputs, many outputs, and a clean rule for composing two such functions. §3.6 draws the corresponding picture as a computational graph, and §3.7 turns it into backpropagation.

Symbols Used Here

$\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$vector-valued function

$\mathbf{J}_{\mathbf{f}}(\mathbf{x})$Jacobian: $m \times n$ matrix of partial derivatives

$\mathbf{g} \circ \mathbf{f}$composition $\mathbf{g}(\mathbf{f}(\mathbf{x}))$

$\partial f_i / \partial x_j$entry $(i,j)$ of the Jacobian

Recap of the scalar chain rule

Before we move into vectors, let us remind ourselves of the version everyone meets first. Suppose we have two functions of a single variable, $f$ and $g$, and we want the derivative of the composition $h(x) = g(f(x))$. The chain rule says

$$ h'(x) = g'(f(x)) \cdot f'(x). $$

In words: the rate at which $h$ changes with $x$ is the rate at which $g$ responds to its own input, evaluated at $f(x)$, multiplied by the rate at which $f$ responds to $x$. Two derivatives, multiplied along the chain.

A small concrete check. Let $f(x) = x^2$ and $g(u) = \sin u$, so $h(x) = \sin(x^2)$. Then $f'(x) = 2x$ and $g'(u) = \cos u$, so $h'(x) = \cos(x^2) \cdot 2x$. We did not need to know any trigonometric identities to write that down; we just multiplied two simple derivatives at the right points.

A second example to make the pattern stick. Take $f(x) = 3x + 1$ and $g(u) = u^2$, giving $h(x) = (3x + 1)^2$. Direct expansion gives $h(x) = 9x^2 + 6x + 1$ and so $h'(x) = 18x + 6$. The chain rule does the same job without expanding: $h'(x) = g'(f(x)) \cdot f'(x) = 2(3x+1) \cdot 3 = 18x + 6$. The chain rule wins here because we never had to multiply out the square. In a deep network we would never expand the equivalent product even once; there are simply too many terms.

Hold on to that pattern. The vector chain rule will look almost the same, with one important change: the multiplication will be matrix multiplication instead of ordinary multiplication, and the order in which we do the multiplications will turn out to matter enormously for efficiency.

The Jacobian

When a function takes a vector in and produces a vector out, a single derivative is not enough. We need a derivative for every (output coordinate, input coordinate) pair. Stacking those into a rectangle gives us the Jacobian.

For a function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian matrix at the point $\mathbf{x}$ is

$$ \mathbf{J}_{\mathbf{f}}(\mathbf{x}) = \begin{pmatrix} \partial f_1/\partial x_1 & \partial f_1/\partial x_2 & \cdots & \partial f_1/\partial x_n \\ \partial f_2/\partial x_1 & \partial f_2/\partial x_2 & \cdots & \partial f_2/\partial x_n \\ \vdots & \vdots & \ddots & \vdots \\ \partial f_m/\partial x_1 & \partial f_m/\partial x_2 & \cdots & \partial f_m/\partial x_n \end{pmatrix}. $$

The shape is $m \times n$: one row for each output coordinate, one column for each input coordinate. The entry in row $i$, column $j$ is $\partial f_i / \partial x_j$, the partial derivative of the $i$-th output with respect to the $j$-th input.

The Jacobian is the natural generalisation of the two notions of derivative we already have. If $m = n = 1$, the Jacobian is a $1 \times 1$ matrix, which is just a single number, the ordinary scalar derivative. If $m = 1$ but $n$ is general, the Jacobian is a $1 \times n$ row, which is the gradient of a scalar-valued function written as a row vector. So gradients and derivatives are special cases of the Jacobian, and the chain rule we are about to write down will quietly contain everything we already know.

A worked example will fix the picture. Take $\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^2$ defined by

$$ \mathbf{f}(x_1, x_2) = (x_1^2 + x_2,\ x_1 x_2). $$

To build the Jacobian we differentiate each component with respect to each input. The first component is $f_1 = x_1^2 + x_2$, so $\partial f_1/\partial x_1 = 2x_1$ and $\partial f_1/\partial x_2 = 1$. The second component is $f_2 = x_1 x_2$, so $\partial f_2/\partial x_1 = x_2$ and $\partial f_2/\partial x_2 = x_1$. Stacking,

$$ \mathbf{J}_{\mathbf{f}}(\mathbf{x}) = \begin{pmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{pmatrix}. $$

At the particular point $(x_1, x_2) = (1, 2)$ this becomes

$$ \mathbf{J}_{\mathbf{f}}(1, 2) = \begin{pmatrix} 2 & 1 \\ 2 & 1 \end{pmatrix}. $$

That single matrix is the best linear approximation to $\mathbf{f}$ near the point $(1, 2)$. If we nudge the input by a small vector $\Delta\mathbf{x}$, the output changes, to first order, by $\mathbf{J}_{\mathbf{f}}(1, 2)\, \Delta\mathbf{x}$.

The vector-valued chain rule

Now we put two such functions together. Suppose $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ and $\mathbf{g}: \mathbb{R}^m \to \mathbb{R}^p$, and form the composition $\mathbf{h} = \mathbf{g} \circ \mathbf{f}$, which means $\mathbf{h}(\mathbf{x}) = \mathbf{g}(\mathbf{f}(\mathbf{x}))$. The vector-valued chain rule says

$$ \mathbf{J}_{\mathbf{h}}(\mathbf{x}) = \mathbf{J}_{\mathbf{g}}(\mathbf{f}(\mathbf{x})) \cdot \mathbf{J}_{\mathbf{f}}(\mathbf{x}). $$

The Jacobian of the composition is the matrix product of the Jacobians, with the outer one written on the left.

Let us check that the shapes line up. The inner Jacobian $\mathbf{J}_{\mathbf{f}}$ is $m \times n$ because $\mathbf{f}$ takes $n$ inputs to $m$ outputs. The outer Jacobian $\mathbf{J}_{\mathbf{g}}$ is $p \times m$ because $\mathbf{g}$ takes $m$ inputs to $p$ outputs. A $p \times m$ matrix multiplied by an $m \times n$ matrix gives a $p \times n$ matrix, which is exactly what we need: $\mathbf{h}$ takes $n$ inputs to $p$ outputs, so its Jacobian is $p \times n$. The inner dimension $m$, the size of the vector that travels from $\mathbf{f}$ to $\mathbf{g}$, is exactly what gets cancelled by the matrix product, just as it should be.

Notice also where the two Jacobians are evaluated. The inner one, $\mathbf{J}_{\mathbf{f}}$, is evaluated at $\mathbf{x}$, the original input. The outer one, $\mathbf{J}_{\mathbf{g}}$, is evaluated at $\mathbf{f}(\mathbf{x})$, the value that $\mathbf{f}$ has handed up the chain. This is the same idea as in the scalar case: each derivative is taken at the point its function actually sees.

A worked example. Let $\mathbf{f}: \mathbb{R} \to \mathbb{R}^2$ be $\mathbf{f}(x) = (x,\ x^2)$ and let $\mathbf{g}: \mathbb{R}^2 \to \mathbb{R}$ be $\mathbf{g}(\mathbf{u}) = u_1 + u_2$. The composition is $h(x) = g(f(x)) = x + x^2$, whose ordinary derivative is $h'(x) = 1 + 2x$.

Now via the chain rule. The Jacobian of $\mathbf{f}$ takes a single input to two outputs, so it is $2 \times 1$:

$$ \mathbf{J}_{\mathbf{f}}(x) = \begin{pmatrix} 1 \\ 2x \end{pmatrix}. $$

The Jacobian of $\mathbf{g}$ takes two inputs to one output, so it is $1 \times 2$:

$$ \mathbf{J}_{\mathbf{g}}(\mathbf{u}) = \begin{pmatrix} 1 & 1 \end{pmatrix}. $$

Multiplying gives a $1 \times 1$ result:

$$ \mathbf{J}_{\mathbf{h}}(x) = \begin{pmatrix} 1 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ 2x \end{pmatrix} = 1 \cdot 1 + 1 \cdot 2x = 1 + 2x. $$

Both routes give the same answer, $1 + 2x$, as they must. The vector chain rule is doing exactly the same job as the scalar chain rule, but it has a bigger book-keeping system that scales up to thousands of inputs and outputs.

Two-layer linear net example

We are now ready to walk through a tiny piece of a neural network. Suppose we have an input vector $\mathbf{x} \in \mathbb{R}^n$, weight matrices $\mathbf{W}^{(1)} \in \mathbb{R}^{h \times n}$ and $\mathbf{W}^{(2)} \in \mathbb{R}^{m \times h}$, and the two-step computation

$$ \mathbf{a}^{(1)} = \mathbf{W}^{(1)} \mathbf{x}, \qquad \mathbf{y} = \mathbf{W}^{(2)} \mathbf{a}^{(1)}. $$

The first step is a linear map from $\mathbb{R}^n$ to $\mathbb{R}^h$, and the second is a linear map from $\mathbb{R}^h$ to $\mathbb{R}^m$. We want the Jacobian of $\mathbf{y}$ with respect to $\mathbf{x}$.

For a linear map of the form $\mathbf{u} \mapsto \mathbf{A}\mathbf{u}$, the Jacobian is just $\mathbf{A}$ itself, because each output is a fixed linear combination of inputs and the partial derivatives are exactly the entries of $\mathbf{A}$. So the Jacobian of $\mathbf{a}^{(1)}$ with respect to $\mathbf{x}$ is $\mathbf{W}^{(1)}$, and the Jacobian of $\mathbf{y}$ with respect to $\mathbf{a}^{(1)}$ is $\mathbf{W}^{(2)}$. By the chain rule,

$$ \mathbf{J}_{\mathbf{y}}(\mathbf{x}) = \mathbf{W}^{(2)} \, \mathbf{W}^{(1)}, $$

a single $m \times n$ matrix that summarises the whole two-step computation. Two linear layers compose to a linear layer; that is one reason we need non-linearities.

So let us add one. Replace the first step with

$$ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x}, \qquad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}), $$

where $\sigma$ is an element-wise activation function such as the sigmoid or ReLU. Element-wise means that the $i$-th coordinate of the output depends only on the $i$-th coordinate of the input: $a^{(1)}_i = \sigma(z^{(1)}_i)$.

What is the Jacobian of an element-wise function? The derivative of $a^{(1)}_i$ with respect to $z^{(1)}_j$ is zero whenever $i \neq j$, because changing one coordinate of the input does not affect a different coordinate of the output. When $i = j$, the derivative is $\sigma'(z^{(1)}_i)$. The Jacobian is therefore the diagonal matrix

$$ \mathbf{J}_{\sigma}(\mathbf{z}^{(1)}) = \mathrm{diag}\bigl(\sigma'(\mathbf{z}^{(1)})\bigr), $$

with the values $\sigma'(z^{(1)}_1), \sigma'(z^{(1)}_2), \ldots$ along the diagonal and zeros everywhere else. This is a useful pattern: every element-wise layer in a network contributes a diagonal Jacobian, which is cheap to store (we only need a vector of slopes) and cheap to multiply by.

Composing all three steps now, the linear map, then the activation, gives the Jacobian of the post-activation $\mathbf{a}^{(1)}$ with respect to $\mathbf{x}$:

$$ \mathbf{J}_{\mathbf{a}^{(1)}}(\mathbf{x}) = \mathrm{diag}\bigl(\sigma'(\mathbf{z}^{(1)})\bigr) \, \mathbf{W}^{(1)}. $$

If we add a second linear layer with its own activation on top, we get one more pair of factors. Real networks just keep stacking factors of this kind, alternating weight matrices and diagonal activation Jacobians. The chain rule says: to differentiate the whole network, multiply all of those factors together in the right order.

Why backprop is "reverse-mode"

Here is the punchline that makes large-scale deep learning possible. In every machine-learning problem we eventually take a scalar loss $\mathcal{L}$, a single number that says how badly the network did. Because the output is one-dimensional, the final Jacobian, the gradient of $\mathcal{L}$, is a row vector. And once one of the factors in our long chain rule product is a row vector, the order in which we evaluate the product matters a great deal.

Picture a network of the form

$$ \mathcal{L} = g\bigl(\mathbf{f}_L(\mathbf{f}_{L-1}(\ldots \mathbf{f}_1(\mathbf{x}) \ldots))\bigr), $$

with intermediate layer widths $d_1, d_2, \ldots, d_L$. The chain rule writes the gradient of $\mathcal{L}$ with respect to $\mathbf{x}$ as a product of Jacobians, with a row vector at the very left and matrices everywhere else. Matrix multiplication is associative, so we are free to evaluate the product from either end.

Forward mode means evaluating left-to-right starting from the input: at each stage we hold a tall column vector whose length is the input dimension, and we multiply on the left by the next Jacobian. For a network with millions of parameters this column vector is gigantic and has to be carried through every layer. The work and memory grow with the input dimension, which is a disaster.

Reverse mode means evaluating right-to-left starting from the loss. We begin with the row gradient $\partial \mathcal{L}/\partial \mathbf{f}_L$, a row vector of length $d_L$. We multiply it on the right by the Jacobian of $\mathbf{f}_L$, getting another row vector of length $d_{L-1}$. We keep going. At every stage we hold a row vector whose length is the size of one layer, never the size of the whole input, and each multiplication costs $O(d_{\ell+1} d_\ell)$, the same scale as the corresponding forward computation.

Reverse mode is what backpropagation is. The "back" in the name is exactly this right-to-left evaluation order: we start at the loss and propagate gradients backwards, layer by layer, towards the input. In §3.6 we will draw this as a graph and in §3.7 we will turn the recipe into code.

To make the saving concrete, imagine a network with input dimension $n = 10^6$, ten hidden layers each of width $10^3$, and a scalar loss at the end. Forward mode would force us to carry an intermediate matrix of size $10^3 \times 10^6$, a billion numbers, between every pair of layers. Reverse mode carries a single row vector of length at most $10^3$, a thousand numbers, at each stage. Same final answer, a million times less memory and a million times less arithmetic. That is the difference between training and not training, between a research project that works and one that runs out of memory before it begins.

What you should take away

The vector chain rule says $\mathbf{J}_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}) = \mathbf{J}_{\mathbf{g}}(\mathbf{f}(\mathbf{x}))\, \mathbf{J}_{\mathbf{f}}(\mathbf{x})$, a matrix product of Jacobians, evaluated at the right points.
The Jacobian of $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ is an $m \times n$ matrix of partial derivatives; gradients (one output) and ordinary derivatives (one in, one out) are special cases.
A linear layer $\mathbf{u} \mapsto \mathbf{A}\mathbf{u}$ has Jacobian $\mathbf{A}$, and an element-wise activation has Jacobian $\mathrm{diag}(\sigma'(\mathbf{z}))$, so a neural network's overall Jacobian is just a product of weight matrices and diagonal slope matrices.
Because the loss is a scalar, the chain-rule product can be evaluated right-to-left while only ever holding a row vector; this is reverse mode, and it is what makes training large networks tractable.
Every routine you will ever call in a deep-learning library, loss.backward() in PyTorch, jax.grad in JAX, is a careful, automated implementation of exactly this rule.