Calculus: 3.12   A small zoo of common gradient identities

Dr Chris Paton

3.12 A small zoo of common gradient identities

When you read a deep-learning paper for the first time, you may feel that the authors are pulling gradients out of thin air. They are not. They are reusing a small set of identities so familiar that nobody bothers to derive them on the page any more. The same handful of patterns recur on almost every line: the gradient of a quadratic form, the gradient of a log-determinant, the gradient of a softmax composed with cross-entropy. If you know these patterns by heart, a derivation that looks like a wall of symbols becomes a few short steps you can almost finish in your head. If you do not, every paper feels like wading through fog.

This section is a reference catalogue. You do not need to memorise every line on a first reading. You need to know that the catalogue exists, recognise its members when they appear, and come back to look something up when a derivation stalls. Each identity is a consequence of the chain rule and reverse-mode automatic differentiation that you met in §§3.5 to 3.7; nothing here is genuinely new. What is new is having all the most useful results gathered in one place, written in a uniform notation, with a hint as to where each one shows up in practice. We will refer back to this catalogue throughout the rest of the book, when we derive backpropagation in Chapter 9, when we discuss likelihoods in Chapter 5, when we open the bonnet on a normalising flow in Chapter 14.

A small note on motivation before we begin. Memorising identities sounds tedious, and there is a real risk that beginners come to treat the symbols as magic incantations. Resist that. Each identity has a one-line proof, and if you ever feel uncertain about one of them, the cheapest thing to do is sit down with a small example, say, two-by-two matrices and two-dimensional vectors, and check it by hand. The point of having the identities ready to hand is not to skip understanding; it is to free you from re-deriving the same five-line argument every time you read a new paper.

Symbols Used Here

$\mathbf{x}$$\mathbf{y}$, vectors

$\mathbf{A}$$\mathbf{X}$, $\mathbf{W}$, matrices

$f$$g$, scalar functions

$\mathbf{a}$$\mathbf{b}$, constant vectors

$\mathbf{I}$the identity matrix

$\delta_{ij}$the Kronecker delta, 1 if $i=j$ and 0 otherwise

Linear and quadratic forms

Linear and quadratic forms are the simplest expressions that involve a vector and have a scalar output. They turn up everywhere, from the dot product in a single neuron to the squared error in least-squares regression. Four identities cover almost every situation you will meet.

The first is the gradient of a linear form. If $f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x}$, then $\nabla_{\mathbf{x}} f = \mathbf{a}$. The proof is one line: writing the dot product out as $\sum_i a_i x_i$, the partial derivative with respect to $x_j$ picks out the single coefficient $a_j$. As a worked example, take $f = 2x_1 + 3x_2$. Differentiating in $x_1$ gives $2$ and differentiating in $x_2$ gives $3$, so the gradient is $(2, 3)^\top$, which is precisely the coefficient vector. The geometric reading is just as simple: the gradient of a linear function is a constant vector pointing in the direction of steepest ascent, with length equal to the rate of climb. The function is a tilted plane and the gradient is its slope.

The second identity is the gradient of a quadratic form. If $f(\mathbf{x}) = \mathbf{x}^\top \mathbf{A} \mathbf{x}$, then $\nabla_{\mathbf{x}} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. When $\mathbf{A}$ is symmetric, and in machine learning it almost always is, because covariance matrices, kernel matrices and Hessians are all symmetric, this collapses to $2\mathbf{A}\mathbf{x}$. The proof writes the form as a double sum $\sum_{i,j} A_{ij} x_i x_j$ and differentiates with respect to a single component $x_k$; one term comes from the $i=k$ slice and another from the $j=k$ slice, giving $(\mathbf{A}\mathbf{x})_k + (\mathbf{A}^\top \mathbf{x})_k$.

The third identity is a special case of the second: $\nabla_{\mathbf{x}} \|\mathbf{x}\|_2^2 = 2\mathbf{x}$. Here $\|\mathbf{x}\|_2^2$ is the squared Euclidean length of the vector, equal to $\mathbf{x}^\top \mathbf{x}$ and so equal to $\mathbf{x}^\top \mathbf{I} \mathbf{x}$ with the identity matrix in the middle. Setting $\mathbf{A} = \mathbf{I}$ in the previous identity gives $2\mathbf{I}\mathbf{x} = 2\mathbf{x}$.

The fourth is the workhorse of least squares. If $f(\mathbf{x}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$, then $\nabla_{\mathbf{x}} f = 2\mathbf{A}^\top(\mathbf{A}\mathbf{x} - \mathbf{b})$. The cleanest derivation lets $\mathbf{y} = \mathbf{A}\mathbf{x} - \mathbf{b}$, applies the chain rule to $\|\mathbf{y}\|^2$, and notes that the Jacobian of $\mathbf{y}$ with respect to $\mathbf{x}$ is just $\mathbf{A}$. Setting this gradient to zero produces the normal equations $\mathbf{A}^\top \mathbf{A}\mathbf{x} = \mathbf{A}^\top \mathbf{b}$, which is the closed-form solution to ordinary least-squares regression. Almost every linear model in this book inherits something from this identity.

Trace and determinant

Once we move from vector inputs to matrix inputs, we need gradients with respect to entire matrices. A few identities cover most cases. The trace is a particularly civilised function to differentiate, because it is linear and treats every diagonal entry equally.

The first three identities are about the trace. The gradient of $\text{tr}(\mathbf{X})$ with respect to $\mathbf{X}$ is the identity matrix $\mathbf{I}$, because the trace simply sums the diagonal entries and so changes by exactly one when any diagonal entry changes by one. The gradient of $\text{tr}(\mathbf{A}\mathbf{X})$ is $\mathbf{A}^\top$, a result one derives by writing the trace as $\sum_{i,j} A_{ij} X_{ji}$ and differentiating component-wise. The gradient of $\text{tr}(\mathbf{X}^\top \mathbf{A} \mathbf{X})$ is $(\mathbf{A} + \mathbf{A}^\top)\mathbf{X}$, the matrix analogue of the quadratic-form identity from the previous subsection.

The next two identities involve the determinant, and they are essential whenever the model contains a Gaussian likelihood or a normalising flow. The log-determinant has the clean gradient $\nabla_{\mathbf{X}} \log\det(\mathbf{X}) = (\mathbf{X}^\top)^{-1}$, often written more compactly as $\mathbf{X}^{-\top}$. This identity is the backbone of every Gaussian likelihood you will meet in Chapter 5: when you take the log of a multivariate normal density, a $\log\det \boldsymbol{\Sigma}$ term appears, and its gradient with respect to the covariance matrix is $\boldsymbol{\Sigma}^{-1}$. The same identity drives the change-of-variables formula that makes normalising flows tractable. The closely related identity for the determinant itself is $\nabla_{\mathbf{X}} \det(\mathbf{X}) = \det(\mathbf{X}) (\mathbf{X}^\top)^{-1}$, which is the previous identity rescaled by the determinant.

The proofs of these results all rely on the elegant identity $\log\det \mathbf{X} = \text{tr}\log \mathbf{X}$, which converts a determinant into a trace and so reduces a non-trivial calculation to the trace identities above. You do not need to remember the proof, but it is worth knowing the trick exists, because most matrix-calculus derivations in physics and statistics start with it.

Activation derivatives

Neural networks are alternations of linear maps and elementwise nonlinear functions called activations. The gradient through any activation is just the elementwise derivative, so we only need to know a small number of one-dimensional derivatives to do the backward pass through any activation layer.

The sigmoid is the original. With $\sigma(z) = 1/(1+e^{-z})$, the derivative satisfies the identity $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. This is convenient in code because once we have computed the sigmoid output during the forward pass, the derivative is a single multiplication of the cached output by one minus itself. The maximum value of this derivative is $0.25$, attained at $z = 0$, and it tends rapidly to zero for either large positive or large negative $z$. This vanishing of the derivative for large inputs is one face of the vanishing-gradient problem that doomed early deep networks before ReLU.

The hyperbolic tangent is sigmoid's better-behaved cousin. Its derivative is $\tanh'(z) = 1 - \tanh^2(z)$, with maximum value $1$ at $z = 0$. Because tanh outputs sit in $[-1, 1]$ rather than $[0, 1]$, gradients flow somewhat more freely than through sigmoid, which is why tanh dominated recurrent networks in the 1990s.

The rectified linear unit, ReLU, has perhaps the simplest derivative in all of machine learning. Defined as $\text{ReLU}(z) = \max(0, z)$, its derivative is the indicator $\mathbb{1}[z > 0]$, one for positive inputs, zero for non-positive. The derivative at $z = 0$ is technically undefined, but in practice frameworks set it to zero and life goes on. ReLU's piecewise-linear shape gives it a constant derivative of one wherever it is active, which is the secret of its success: gradients pass through unchanged, with no vanishing.

GELU, the activation favoured by modern transformers, has a more elaborate derivative that is not closed-form in elementary functions, and in practice every framework computes it via autograd. The function is approximately $0.5 \, z \, [1 + \tanh(\sqrt{2/\pi}(z + 0.044715 z^3))]$, and the derivative inherits this shape, smoother than ReLU but with similar large-input behaviour.

Softmax and cross-entropy

The composition of softmax with cross-entropy is the single most-used calculation in classifiers. Almost every model you will meet that predicts a discrete class, from logistic regression in Chapter 7 through to the next-token prediction inside GPT, performs exactly this composition on its output layer.

The softmax converts a vector of real-valued scores, called logits, into a probability distribution. Its definition is $\text{softmax}(\mathbf{z})_i = e^{z_i}/\sum_j e^{z_j}$. The exponentials guarantee positivity and the denominator forces the values to sum to one. Differentiating component-wise, the Jacobian of softmax is $\partial s_i/\partial z_j = s_i (\delta_{ij} - s_j)$, sometimes written as $\text{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top$.

The cross-entropy loss with a one-hot target $\mathbf{y}$ is $\mathcal{L} = -\sum_i y_i \log s_i$. Substituting the softmax Jacobian into the chain rule and exploiting that the target sums to one, the gradient of the composite loss with respect to the logits collapses to the famously clean expression

$$ \frac{\partial \mathcal{L}}{\partial z_i} = s_i - y_i. $$

The error signal at each logit is simply predicted probability minus true label. There is no leftover softmax derivative, no Kronecker delta, no division, just a subtraction. This single line is the reason every classifier's backward pass is so cheap to write.

Frameworks fuse softmax and cross-entropy into a single primitive, PyTorch's cross_entropy, TensorFlow's softmax_cross_entropy_with_logits, for two reasons. The first is that the gradient is so simple that there is no point computing the softmax derivative separately and then composing. The second is numerical: when a logit is very large, $e^{z_i}$ overflows, but the fused operation can be rearranged to subtract the maximum logit before exponentiating, avoiding the overflow entirely. Always pass raw logits to a cross-entropy primitive, never softmax outputs.

The same pattern recurs for sigmoid combined with binary cross-entropy: the gradient at the logit is again predicted minus true. Whenever a non-linearity and a loss are matched in this way, softmax with cross-entropy, sigmoid with binary cross-entropy, identity with mean-squared error, the composite gradient simplifies dramatically. This is no coincidence: each pair is the maximum-likelihood gradient for an exponential-family distribution, and the cancellation falls out of the family's structure.

Gradients through layer normalisation

Layer normalisation is a standard component of modern architectures, including every transformer in this book. Given an input vector $\mathbf{x}$, layer norm computes the mean $\mu$ and standard deviation $\sigma$ along the feature dimension, and produces the normalised output $\hat{\mathbf{x}} = (\mathbf{x} - \mu)/\sigma$. The forward pass is straightforward; the backward pass is fiddly. Each input contributes both to the numerator $(\mathbf{x} - \mu)$ and to the denominator $\sigma$ via $\mu$ and the variance, so the Jacobian of $\hat{\mathbf{x}}$ with respect to $\mathbf{x}$ has off-diagonal terms that couple every output to every input.

In practice no human writes this gradient by hand. We delegate it to autograd, which records the forward computation as a graph and runs reverse-mode AD through it automatically. The same comment applies to batch normalisation, RMSNorm and most other normalisation layers, get the forward pass right, trust the framework with the backward pass, and check by finite differences if you suspect a bug.

Common patterns

A small handful of patterns govern the gradients of every layer in a neural network, and they are worth absorbing because they let you predict the shape of any backward pass before you even derive it. The first is the outer-product pattern that defines the gradient of a linear layer. For a layer with weight matrix $\mathbf{W}^{(\ell)}$, error signal $\boldsymbol{\delta}^{(\ell)}$ and incoming activation $\mathbf{a}^{(\ell-1)}$, the weight gradient is the outer product $\boldsymbol{\delta}^{(\ell)} (\mathbf{a}^{(\ell-1)})^\top$. This is the equation of backpropagation, and you will see it written a hundred times in Chapter 9.

The second pattern is that the backward pass through a transpose is itself a transpose. If the forward computation is $\mathbf{Y} = \mathbf{X}^\top$, then the gradient with respect to $\mathbf{X}$ is the gradient with respect to $\mathbf{Y}$ transposed: $\bar{\mathbf{X}} = \bar{\mathbf{Y}}^\top$. Reshape behaves similarly: the backward pass through a reshape just reshapes the gradient back to the original shape, with no arithmetic at all.

The third pattern is the broadcast rule. When a forward operation broadcasts a smaller tensor across a larger one, adding a bias vector to every row of a matrix, for instance, the backward pass sums the gradient over the broadcast dimensions to recover a tensor of the original smaller shape. The intuition is that broadcasting reuses the same element many times, and each use contributes independently to the loss; gradient accumulation across uses is summation.

The fourth pattern concerns indexing. If the forward operation selects a small number of entries from a larger tensor, say, looking up word embeddings by index, the backward pass produces a gradient that is zero everywhere except at the indexed positions, where it equals the incoming gradient. Implemented efficiently this is a sparse scatter rather than a dense add.

These four patterns together cover the backward pass through almost every standard layer in modern deep learning, and recognising them speeds up the reading of any architecture diagram considerably.

What you should take away

The catalogue is small. Roughly fifteen identities cover the bulk of every gradient calculation in machine learning; once they are familiar, derivations in papers become readable rather than mysterious.
The least-squares gradient $2\mathbf{A}^\top(\mathbf{A}\mathbf{x} - \mathbf{b})$ is the workhorse of linear modelling and produces the normal equations when set to zero.
The softmax-plus-cross-entropy gradient simplifies to predicted minus true, $s_i - y_i$, which is why frameworks fuse these two operations into one primitive and why classifier backward passes are so cheap.
Activation derivatives are elementwise and should be cached during the forward pass for cheap reuse during the backward pass; sigmoid's $\sigma(1-\sigma)$ and tanh's $1-\tanh^2$ are the canonical examples.
For anything more elaborate than these patterns, layer norm, attention, fancy normalising flows, trust autograd. Hand derivation is for understanding; the framework is for production.