Linear Algebra: 2.9   Matrix calculus and vector derivatives

Dr Chris Paton

2.9 Matrix calculus and vector derivatives

Training a neural network amounts to nudging millions, sometimes billions, of numbers in the right direction. The "right direction" is decided by a single tool: the gradient. The gradient tells us, for every parameter at once, whether it should be raised or lowered, and by how much, in order to make the model's predictions a little less wrong. Computing this gradient is what backpropagation does, and the language in which it does it is matrix calculus. There is no way around the topic. Anyone who wishes to understand how a network actually learns, as opposed to merely watching the loss curve trend downwards, must be comfortable with derivatives that take vectors and matrices as both their inputs and their outputs.

The good news is that the rules look much more frightening than they really are. The notation is dense, the symbols multiply, and a beginner can be forgiven for thinking that an entirely new kind of mathematics is involved. It is not. Matrix calculus is ordinary single-variable calculus, applied component by component, with a tidy bookkeeping system that lets us bundle up many ordinary derivatives into a single matrix or vector. Once a small handful of identities have been internalised, the rest is mechanical. The aim of this section is to introduce those identities and to show, with worked numerical examples, exactly how they are used in the gradient calculations that drive every modern model.

This section is the differential half of the linear-algebra story we have been building. §2.4 to §2.7 supplied the structure: vectors, matrices, the four fundamental subspaces, eigenvalues, the singular value decomposition. §2.9 supplies the calculus that lives on top of that structure. Chapter 3 will revisit the same ideas from the optimisation side, recasting them as the engine of gradient descent. Chapter 9, on backpropagation, will use the rules below at every layer of every network. From this section onwards, almost every later chapter will assume that you can read a gradient like $\partial \mathcal{L}/\partial \mathbf{W}$ and know what shape it has, what its entries mean, and how to derive it.

Symbols Used Here

$f$scalar-valued function

$\mathbf{x}$$\mathbf{y}$, vectors

$\mathbf{A}$$\mathbf{X}$, matrices

$\nabla_{\mathbf{x}} f$gradient: vector of partial derivatives, same shape as $\mathbf{x}$

$\nabla_{\mathbf{X}} f$matrix of partial derivatives, same shape as $\mathbf{X}$

$\mathbf{J}_{\mathbf{f}}(\mathbf{x})$Jacobian of vector-valued $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, shape $m \times n$

$\mathbf{H}_f(\mathbf{x})$Hessian: matrix of second partial derivatives, shape $n \times n$

Gradients of common scalar functions

Begin with the simplest case: a function $f$ that takes a vector $\mathbf{x} = (x_1, x_2, \ldots, x_n)^\top$ and returns a single real number. The gradient of $f$ at $\mathbf{x}$, written $\nabla_{\mathbf{x}} f$, is the vector whose $i$-th entry is the partial derivative $\partial f / \partial x_i$. It has the same shape as $\mathbf{x}$ itself, which is the most useful convention for machine learning because it lets us write the parameter update $\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} f$ without any acrobatic transposing. Geometrically, the gradient points in the direction of steepest increase of $f$, and its length tells us how steep that increase is. For minimisation we move in the opposite direction, which is exactly what gradient descent does.

A handful of formulas covers most of what we shall ever need. Each can be checked by writing the function out in components and differentiating in the ordinary way, but it is quicker to memorise the patterns.

The linear function. If $f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} = \sum_i a_i x_i$ for a constant vector $\mathbf{a}$, then $\nabla_{\mathbf{x}} f = \mathbf{a}$. The reasoning is immediate: the partial derivative with respect to $x_j$ is the coefficient of $x_j$ in the sum, which is $a_j$. Worked: take $f(\mathbf{x}) = 2x_1 + 3x_2$. Then $\partial f/\partial x_1 = 2$ and $\partial f/\partial x_2 = 3$, so $\nabla f = (2, 3)^\top$, exactly the constant vector $\mathbf{a} = (2, 3)^\top$ that defines $f$.

The quadratic form. If $f(\mathbf{x}) = \mathbf{x}^\top \mathbf{A}\mathbf{x}$ for a constant matrix $\mathbf{A}$, then $\nabla_{\mathbf{x}} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. When $\mathbf{A}$ is symmetric, and in machine learning it usually is, because covariance matrices, Hessians, and kernel matrices are all symmetric, this collapses to the cleaner $\nabla_{\mathbf{x}} f = 2\mathbf{A}\mathbf{x}$. Worked: let $\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 2 & 3 \end{pmatrix}$, which is symmetric. Then $f(\mathbf{x}) = \mathbf{x}^\top \mathbf{A}\mathbf{x} = x_1^2 + 4 x_1 x_2 + 3 x_2^2$. The cross-term carries a factor of $4$ rather than $2$ because the off-diagonal entries appear twice in the sum, once above the diagonal and once below. Differentiating directly: $\partial f/\partial x_1 = 2x_1 + 4x_2$ and $\partial f/\partial x_2 = 4x_1 + 6x_2$. Comparing to $2\mathbf{A}\mathbf{x} = (2x_1 + 4x_2, \; 4x_1 + 6x_2)^\top$ confirms the rule.

The squared norm. If $f(\mathbf{x}) = \|\mathbf{x}\|_2^2 = \mathbf{x}^\top \mathbf{x}$, then $\nabla_{\mathbf{x}} f = 2\mathbf{x}$. This is simply the quadratic-form rule with $\mathbf{A} = \mathbf{I}$. The squared norm appears in every $\ell^2$ regulariser ("weight decay"), in mean-squared-error losses, and in the kinetic-energy term of many physical models, so this is one to know by heart.

A scalar function of an inner product. If $f(\mathbf{x}) = g(\mathbf{a}^\top \mathbf{x})$, where $g$ is an ordinary single-variable function, then by the chain rule $\nabla_{\mathbf{x}} f = g'(\mathbf{a}^\top \mathbf{x}) \cdot \mathbf{a}$. The pattern is: differentiate the outer scalar function in the usual way, leave the inner derivative $\mathbf{a}$ as it is. This is the structure of every linear-classifier gradient. For logistic regression, $g$ is the sigmoid composed with a log; for a single neuron with weights $\mathbf{a}$, $g$ is whatever activation that neuron uses. The same five-symbol formula recurs throughout backpropagation.

It is worth pausing on what the gradient actually does for us. At a point $\mathbf{x}_0$, $\nabla f(\mathbf{x}_0)$ is the vector of slopes along each coordinate axis. The directional derivative along any unit vector $\mathbf{u}$ is then $\mathbf{u}^\top \nabla f(\mathbf{x}_0)$. By the Cauchy–Schwarz inequality this is largest when $\mathbf{u}$ is parallel to the gradient, and most negative when it is anti-parallel. That is the formal reason gradient descent works: stepping by $-\eta \nabla f$ is, locally, the steepest possible decrease of $f$ for a step of fixed length.

Jacobians

What if the output is itself a vector? Suppose $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, so $\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), f_2(\mathbf{x}), \ldots, f_m(\mathbf{x}))^\top$. Each component $f_i$ is itself a scalar function with its own gradient, a row of $n$ partial derivatives. Stacking these $m$ rows on top of one another gives the Jacobian matrix $\mathbf{J}_{\mathbf{f}}(\mathbf{x})$, an $m \times n$ matrix whose $(i, j)$ entry is $\partial f_i/\partial x_j$.

Worked example. Let $\mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^2$ be defined by $\mathbf{f}(\mathbf{x}) = (x_1^2 + x_2, \; x_1 x_2)^\top$. The first component is $f_1 = x_1^2 + x_2$, with partial derivatives $\partial f_1/\partial x_1 = 2x_1$ and $\partial f_1/\partial x_2 = 1$. The second is $f_2 = x_1 x_2$, with partial derivatives $\partial f_2/\partial x_1 = x_2$ and $\partial f_2/\partial x_2 = x_1$. Stacking: $$\mathbf{J}_{\mathbf{f}}(\mathbf{x}) = \begin{pmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{pmatrix}.$$ At the point $\mathbf{x} = (1, 2)^\top$, the Jacobian evaluates to $\mathbf{J} = \begin{pmatrix} 2 & 1 \\ 2 & 1 \end{pmatrix}$. Notice that this matrix happens to be singular at this particular point: the rows are identical, so its determinant is zero. That tells us $\mathbf{f}$ is locally degenerate near $(1, 2)^\top$, small input perturbations can only push the output along a one-dimensional sliver. Information about the structure of $\mathbf{f}$ is encoded directly in the algebra of $\mathbf{J}$.

The Jacobian is best understood as the best linear approximation of $\mathbf{f}$ near $\mathbf{x}$. Just as a single-variable derivative gives the slope of the tangent line, the Jacobian gives the matrix of the tangent linear map. The first-order Taylor expansion is $$\mathbf{f}(\mathbf{x} + \boldsymbol{\delta}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J}_{\mathbf{f}}(\mathbf{x})\, \boldsymbol{\delta}$$ for small displacements $\boldsymbol{\delta}$. If you push the input by a tiny vector $\boldsymbol{\delta}$, the output is pushed by approximately $\mathbf{J} \boldsymbol{\delta}$. Sensitivity analysis, error propagation, and the implicit-function theorem all rest on this single observation.

The Jacobian also gives us the chain rule for vector-valued functions, which is the working definition of backpropagation. Let $\mathbf{g}: \mathbb{R}^m \to \mathbb{R}^p$ and $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, and form the composition $\mathbf{h} = \mathbf{g} \circ \mathbf{f}$, that is, $\mathbf{h}(\mathbf{x}) = \mathbf{g}(\mathbf{f}(\mathbf{x}))$. Then $$\mathbf{J}_{\mathbf{h}}(\mathbf{x}) = \mathbf{J}_{\mathbf{g}}(\mathbf{f}(\mathbf{x}))\, \mathbf{J}_{\mathbf{f}}(\mathbf{x}).$$ Jacobians multiply, in the order in which the underlying maps are applied. This is the reason a deep network can be differentiated at all. A network of $L$ layers is a composition of $L$ functions $\mathbf{f}^{(L)} \circ \cdots \circ \mathbf{f}^{(1)}$, and its Jacobian is just the product of the layer-wise Jacobians. Each entry in this product can be enormous, think of an $m \times n$ matrix with $m, n$ in the thousands, but in practice we never form the matrices explicitly; backpropagation computes only the matrix-vector products it actually needs, which is what makes the whole enterprise tractable.

A vital special case: when $\mathbf{f}$ outputs a single scalar ($m = 1$), the Jacobian is a $1 \times n$ row vector and is essentially the gradient (transposed). Gradients and Jacobians are not different objects, the gradient is just the Jacobian of a real-valued function, dressed as a column vector to match the shape of $\mathbf{x}$.

Gradients with respect to matrices

When the parameter is itself a matrix $\mathbf{X}$, exactly the same idea applies, only the bookkeeping is denser. The matrix gradient $\nabla_{\mathbf{X}} f$ is the matrix whose $(i, j)$ entry is $\partial f / \partial X_{ij}$, with the same shape as $\mathbf{X}$. There is nothing conceptually new, every entry is just an ordinary partial derivative, but writing the gradient as a single matrix lets us apply it directly in updates like $\mathbf{X} \leftarrow \mathbf{X} - \eta \nabla_{\mathbf{X}} f$, which is precisely the form that gradient descent on a weight matrix takes.

A small library of identities will carry us through almost every situation we shall meet. These can all be proved by writing out the trace as a sum and differentiating component by component.

$\nabla_{\mathbf{X}} \text{tr}(\mathbf{X}) = \mathbf{I}$. The trace is the sum of the diagonal entries, so its derivative is $1$ on the diagonal and zero elsewhere.
$\nabla_{\mathbf{X}} \text{tr}(\mathbf{A}\mathbf{X}) = \mathbf{A}^\top$. The trace picks out a particular linear combination of the entries of $\mathbf{X}$, and the coefficients form $\mathbf{A}^\top$.
$\nabla_{\mathbf{X}} \text{tr}(\mathbf{X}^\top \mathbf{A} \mathbf{X}) = (\mathbf{A} + \mathbf{A}^\top)\mathbf{X}$. This is the matrix analogue of the quadratic-form rule for vectors; for symmetric $\mathbf{A}$ it reduces to $2\mathbf{A}\mathbf{X}$.
$\nabla_{\mathbf{X}} \log\det(\mathbf{X}) = (\mathbf{X}^\top)^{-1}$. This identity appears in normalising flows, in Gaussian likelihoods, and anywhere a change-of-variables formula is needed.
$\nabla_{\mathbf{X}} \|\mathbf{X}\|_F^2 = 2\mathbf{X}$. The Frobenius norm squared is the sum of squares of every entry, so the gradient is twice the matrix itself.

Worked example. Let $f(\mathbf{X}) = \text{tr}(\mathbf{X}^\top \mathbf{X}) = \|\mathbf{X}\|_F^2 = \sum_{i,j} X_{ij}^2$. The simplest proof is by direct differentiation: $\partial f / \partial X_{ij} = 2 X_{ij}$, so $\nabla f = 2\mathbf{X}$. This agrees with both the third identity (with $\mathbf{A} = \mathbf{I}$, giving $(\mathbf{I} + \mathbf{I})\mathbf{X} = 2\mathbf{X}$) and the fifth identity (the Frobenius rule). The calculation is a useful sanity check: when several rules give the same answer, your bookkeeping is internally consistent.

These five identities cover the gradient calculations behind ordinary least squares, ridge regression, principal component analysis, Gaussian-process likelihoods, Lasso (where the absolute-value norm calls for a sub-gradient instead), and the matrix-valued layers of a neural network. Most of the matrix-calculus problems you will meet in practice can be reduced to a combination of these five, plus the chain rule.

The chain rule for matrix calculus

The chain rule is what assembles all the identities above into the single algorithm we call backpropagation. The key idea is that gradients flow backwards through a computation: starting from the loss at the output, partial derivatives are propagated layer by layer towards the input, with each step requiring nothing more than a matrix-vector product and an outer product.

Take a minimal example, a two-layer linear network with no nonlinearity, so we can see the rule unobscured. The forward pass is $$\mathbf{a}^{(1)} = \mathbf{W}^{(1)} \mathbf{x}, \quad \mathbf{y} = \mathbf{W}^{(2)} \mathbf{a}^{(1)},$$ and the loss is squared error against a target $\mathbf{t}$, $$\mathcal{L} = \tfrac{1}{2} \|\mathbf{y} - \mathbf{t}\|_2^2.$$ The parameters we wish to update are $\mathbf{W}^{(1)}$ and $\mathbf{W}^{(2)}$. The chain rule lets us compute their gradients in a sequence of small steps.

Step one: the derivative of the loss with respect to the network's output. Since $\mathcal{L} = \tfrac{1}{2}(\mathbf{y} - \mathbf{t})^\top (\mathbf{y} - \mathbf{t})$, the squared-norm rule (with $\mathbf{x}$ replaced by $\mathbf{y} - \mathbf{t}$) gives $$\frac{\partial \mathcal{L}}{\partial \mathbf{y}} = \mathbf{y} - \mathbf{t}.$$ This is the error signal at the output: a vector pointing from the prediction towards the target. Everything else in the backward pass is a transformation of this single quantity.

Step two: the gradient with respect to the second weight matrix. Because $\mathbf{y} = \mathbf{W}^{(2)} \mathbf{a}^{(1)}$, the dependence of $\mathcal{L}$ on $\mathbf{W}^{(2)}$ is linear, and the chain rule produces an outer product: $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} = (\mathbf{y} - \mathbf{t})\, (\mathbf{a}^{(1)})^\top.$$ This is a matrix the same shape as $\mathbf{W}^{(2)}$. The pattern, error vector on the left, input to the layer on the right, joined by the outer product, is general: every weight-matrix gradient in a feed-forward network has this rank-one structure. It is also the deepest reason that gradient descent works at the parameter scale of modern models. The gradient of a matrix with millions of entries is built from one outer product per training example, never from anything more expensive.

Step three: backpropagate the error to the previous layer's activations. Here we need the gradient of the loss with respect to $\mathbf{a}^{(1)}$, treating $\mathbf{W}^{(2)}$ as fixed: $$\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(1)}} = (\mathbf{W}^{(2)})^\top (\mathbf{y} - \mathbf{t}).$$ The transpose appears because the Jacobian of $\mathbf{W}^{(2)} \mathbf{a}^{(1)}$ with respect to $\mathbf{a}^{(1)}$ is $\mathbf{W}^{(2)}$ itself, and the chain rule for row-vector gradients introduces a transpose to match shapes. Practically, the transposed-times-vector operation is what carries the error signal one layer back.

Step four: the gradient with respect to the first weight matrix. The error has now arrived at $\mathbf{a}^{(1)}$, and since $\mathbf{a}^{(1)} = \mathbf{W}^{(1)} \mathbf{x}$, we apply the same outer-product trick as in step two: $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = (\mathbf{W}^{(2)})^\top (\mathbf{y} - \mathbf{t})\, \mathbf{x}^\top.$$ And we are done. Two outer products, one transposed-times-vector, one elementary subtraction. That is the full backward pass for a two-layer linear network, and it is structurally identical to the backward pass for a hundred-layer transformer, the only difference being a longer chain of intermediate signals between the loss and the input.

Two takeaways, both of which generalise. Outer products give the gradient with respect to a weight matrix. Whenever a weight matrix appears in a layer, its gradient is an outer product of the error signal at the layer's output and the input to that layer. Transposed-times-vector backpropagates the error to the previous layer. Whenever you cross a linear map on the way back, you multiply the error by the transposed weight matrix. These two operations, together with element-wise multiplications by activation derivatives, exhaust the entire backward pass of a feed-forward network.

Hessian

For a scalar function $f(\mathbf{x}): \mathbb{R}^n \to \mathbb{R}$, second derivatives form a matrix called the Hessian, $\mathbf{H}_f(\mathbf{x})$, with entries $H_{ij} = \partial^2 f / \partial x_i \, \partial x_j$. By Schwarz's theorem on the equality of mixed partial derivatives, the Hessian of any twice-continuously-differentiable function is symmetric: $H_{ij} = H_{ji}$. This symmetry is more than cosmetic, it means we can apply every tool we built in §2.6 (eigenvalues, diagonalisation, quadratic-form analysis) directly to the Hessian.

Worked example. Take $f(\mathbf{x}) = x_1^2 + 2 x_1 x_2 + 3 x_2^2$. The first derivatives are $$\nabla f = (2x_1 + 2x_2, \; 2x_1 + 6x_2)^\top.$$ Differentiating each component again gives the four second derivatives: $\partial^2 f/\partial x_1^2 = 2$, $\partial^2 f/\partial x_2^2 = 6$, and $\partial^2 f/\partial x_1 \partial x_2 = \partial^2 f/\partial x_2 \partial x_1 = 2$. Stacking: $$\mathbf{H} = \begin{pmatrix} 2 & 2 \\ 2 & 6 \end{pmatrix}.$$ Symmetric, as promised. Its eigenvalues are the roots of $(2 - \lambda)(6 - \lambda) - 4 = \lambda^2 - 8\lambda + 8 = 0$, namely $\lambda = 4 \pm 2\sqrt{2}$, both strictly positive. That positivity has a geometric meaning: at every point, $f$ curves upwards in every direction, so its only stationary point (the origin) is a strict local minimum.

The Hessian's eigenvalues describe the local curvature of the loss surface. All positive: a strict local minimum, a bowl. All negative: a strict local maximum, an inverted bowl. Mixed signs: a saddle, curving up in some directions and down in others. Some near zero: a "valley" in which $f$ is almost flat along certain directions, the bane of steepest-descent methods because the gradient gives little guidance there. The condition number of $\mathbf{H}$, the ratio of largest to smallest eigenvalue, predicts how poorly gradient descent will behave: highly anisotropic surfaces zig-zag.

Newton's method exploits the Hessian directly. The update $\mathbf{x} \leftarrow \mathbf{x} - \mathbf{H}^{-1} \nabla f$ rescales the step so that, in the local quadratic approximation, the next iterate lands exactly at the bottom of the bowl. Where ordinary gradient descent shuffles its way along the floor of a long valley, Newton's method jumps in a single bound. The catch is that for a network with $n$ parameters, $\mathbf{H}$ has $n^2$ entries; for a model of even modest size this is far too many to store, let alone to invert. Modern deep learning therefore uses approximations: K-FAC factorises the Hessian along layer boundaries; natural-gradient methods replace $\mathbf{H}$ with the Fisher information matrix; quasi-Newton methods such as L-BFGS build a low-rank approximation from a sequence of gradient differences. Each is a different compromise between fidelity to the true curvature and the cost of computing it. Even when we cannot afford to use the Hessian, simply thinking about it, about whether the loss landscape is steep or flat in the direction of the latest update, is one of the most reliable ways of choosing learning rates and batch sizes that actually work.

What you should take away

The gradient of a scalar function packages every partial derivative into a vector of the same shape as the input, and points in the direction of steepest increase. Memorise the four scalar identities, linear, quadratic form, squared norm, scalar-of-inner-product, and you can derive most loss-function gradients from scratch.
The Jacobian generalises the gradient to vector-valued functions, giving the best linear approximation near a point. The chain rule for compositions becomes the rule that Jacobians multiply, which is the engine of backpropagation through deep networks.
Matrix gradients are bookkeeping matrices the same shape as the matrix parameter, and a small library of trace identities, including $\nabla_{\mathbf{X}} \log\det(\mathbf{X}) = (\mathbf{X}^\top)^{-1}$, covers most situations encountered in practice.
The backward pass for any feed-forward network is built from two operations: outer products produce gradients with respect to weight matrices, and transposed-times-vector multiplications carry the error signal one layer back. Both fall straight out of the chain rule.
The Hessian is the symmetric matrix of second derivatives, and its eigenvalues describe the curvature of the loss landscape: positive everywhere means a local minimum, mixed signs mean a saddle. Computing the Hessian directly is infeasible for large networks, so practitioners rely on approximations like K-FAC, natural gradients, or L-BFGS, but a qualitative sense of curvature is essential for choosing learning rates, batch sizes, and adaptive optimisers.