3.4 Partial derivatives and gradients
Almost every story in modern machine learning ends with the same sentence: we adjusted the parameters until the loss got smaller. That sentence is doing a great deal of work. The model in question may have a hundred million parameters, or a hundred billion, and the loss is a single number that we want as small as possible. The natural question is: when we change one parameter slightly, by how much does the loss change? When we change all of them at once, in some chosen direction, what happens? These are questions about rates of change, and rates of change are exactly what calculus is for.
In §3.3 we met the derivative of a function of one variable. That single number told us the slope of the function, how fast it rises or falls as we nudge the input. The same idea applies in higher dimensions, but with one important upgrade. When the function depends on many inputs, there is no single slope. Instead, there is a slope along each axis, and these slopes assemble into a vector called the gradient. The gradient points in the direction in which the function increases most steeply. Its negative points the other way. That negative direction is the one we follow whenever we train a neural network: down the slope of the loss surface, one small step at a time. This section explains exactly what those words mean.
In this section the function still produces a single number, a scalar, but it can take a whole vector as input. That is the right setting for a loss function, which maps a parameter vector to a single real number measuring how badly the model is doing.
Partial derivatives
A partial derivative is the derivative of a multi-variable function taken with respect to one of its variables, while every other variable is held still. The notation looks intimidating, that curly $\partial$ instead of the usual $d$, but the idea is unfussy. Imagine you are standing on a hillside. The height of the ground depends on your east–west position $x_1$ and your north–south position $x_2$. The partial derivative $\partial f/\partial x_1$ asks: if I take a small step due east, leaving $x_2$ unchanged, how fast does the height change? The partial $\partial f/\partial x_2$ asks the same question for a step due north. The two answers are usually different, and that is fine. The hill does not have a single slope; it has a slope in every direction.
Formally, the partial derivative of $f$ with respect to $x_i$ at the point $\mathbf{x}$ is
$$\frac{\partial f}{\partial x_i}(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h \mathbf{e}_i) - f(\mathbf{x})}{h},$$
where $\mathbf{e}_i$ is the unit vector along axis $i$, that is, the vector with a $1$ in slot $i$ and zeros everywhere else. The expression $\mathbf{x} + h \mathbf{e}_i$ is just $\mathbf{x}$ with its $i$-th coordinate nudged by $h$. The limit is the same kind of object you saw in §3.3, only now applied to a single coordinate.
Computing partial derivatives is mechanical. You decide which variable is "live" and which are "dead", then you differentiate as if the dead ones were ordinary numbers, applying every rule from §3.3, the power rule, the product rule, the chain rule. There is nothing new to learn beyond that small trick of the imagination.
A small notational point. Some authors write $f_{x_i}$ instead of $\partial f/\partial x_i$, and others write $D_i f$. They all mean the same thing. The curly $\partial$ is purely a flag to remind us that other variables are present and being held fixed; it is not a different kind of derivative. If $f$ depended on only one variable, the partial would coincide exactly with the ordinary derivative $df/dx$, and we could go back to using $d$. The reason the symbol changed is just that, with several variables in play, the bookkeeping needs to be unambiguous.
A worked example will make this concrete. Take
$$f(x_1, x_2) = x_1^2 + 3 x_1 x_2 + x_2^3.$$
To find $\partial f/\partial x_1$, treat $x_2$ as a constant. The first term $x_1^2$ differentiates to $2 x_1$. The second term $3 x_1 x_2$ is a constant ($3 x_2$) times $x_1$, so it differentiates to $3 x_2$. The third term $x_2^3$ has no $x_1$ in it, so it is a constant and its derivative is zero. Adding these up,
$$\frac{\partial f}{\partial x_1} = 2 x_1 + 3 x_2.$$
For $\partial f/\partial x_2$, swap the roles. Now $x_1$ is constant. The first term has no $x_2$, so it contributes zero. The second term $3 x_1 x_2$ differentiates to $3 x_1$. The third term $x_2^3$ differentiates to $3 x_2^2$. So
$$\frac{\partial f}{\partial x_2} = 3 x_1 + 3 x_2^2.$$
At the point $(1, 2)$, plug the numbers in: $\partial f/\partial x_1 = 2(1) + 3(2) = 2 + 6 = 8$, and $\partial f/\partial x_2 = 3(1) + 3(2)^2 = 3 + 12 = 15$. Two slopes, two directions, two numbers.
The gradient vector
If we collect every partial derivative into a single object, we get the gradient:
$$\nabla f(\mathbf{x}) = \left( \frac{\partial f}{\partial x_1}, \, \frac{\partial f}{\partial x_2}, \, \ldots, \, \frac{\partial f}{\partial x_n} \right).$$
The symbol $\nabla$ is read "nabla" or simply "del". The gradient is a vector in $\mathbb{R}^n$, the same shape as the input $\mathbf{x}$. If your parameters fit in a vector of length one billion, your gradient also has length one billion, with one slope per parameter. That is precisely how PyTorch and every other automatic-differentiation library will report it.
The geometric meaning of the gradient is the single most useful fact in optimisation. The gradient $\nabla f(\mathbf{x})$ points in the direction in which $f$ is rising most steeply at the point $\mathbf{x}$. Its length $\|\nabla f(\mathbf{x})\|$ tells you how steeply: it is the steepest rate of climb. The negative gradient $-\nabla f(\mathbf{x})$ therefore points in the direction in which $f$ falls most steeply, and that is the direction in which gradient descent steps. Training a neural network is, at its core, repeated tiny steps down the negative gradient of the loss.
Continuing our example, at the point $(1, 2)$ we computed the partials $\partial f/\partial x_1 = 8$ and $\partial f/\partial x_2 = 15$, so the gradient is
$$\nabla f(1, 2) = (8, 15)^\top.$$
If we wanted to increase $f$ as quickly as possible from this point, we would move in the direction $(8, 15)$. To decrease $f$, we move in the opposite direction $(-8, -15)$. Often we normalise this to a unit vector, dividing by its length: the unit vector pointing in the direction of fastest decrease is $-(8, 15)/\sqrt{8^2 + 15^2} = -(8, 15)/17$. The factor $17$ here is no accident: it is the norm of the gradient, and it will appear again in the next subsection.
Directional derivatives
The partial derivatives only tell us the rate of change along the coordinate axes. What if we want to know the rate of change in some other direction, say, north-east, or along some skew vector that mixes several coordinates at once? That quantity is the directional derivative.
If $\mathbf{u}$ is a unit vector (a vector of length $1$), then the directional derivative of $f$ at $\mathbf{x}$ in the direction $\mathbf{u}$ is
$$D_{\mathbf{u}} f(\mathbf{x}) = \nabla f(\mathbf{x})^\top \mathbf{u} = \|\nabla f(\mathbf{x})\| \, \cos\theta,$$
where $\theta$ is the angle between $\mathbf{u}$ and $\nabla f$. The first equality is a dot product of the gradient with the chosen direction. The second equality follows because, for any two vectors, the dot product equals the product of their norms times the cosine of the angle between them; since $\mathbf{u}$ has norm $1$, only $\|\nabla f\|$ remains.
This compact formula tells us a great deal:
- When $\theta = 0$, $\cos\theta = 1$, and $D_{\mathbf{u}} f = \|\nabla f\|$. The maximum rate of change is $\|\nabla f\|$, achieved when $\mathbf{u}$ is aligned with $\nabla f$. This is the formal proof that the gradient points in the direction of steepest ascent.
- When $\theta = 90°$, $\cos\theta = 0$, and $D_{\mathbf{u}} f = 0$. Directions perpendicular to the gradient give no first-order change in $f$. These are the directions that keep you on a level set, or contour line.
- When $\theta = 180°$, $\cos\theta = -1$, and $D_{\mathbf{u}} f = -\|\nabla f\|$. The minimum rate is achieved opposite to the gradient, the direction of steepest descent.
Back to the example. At $(1, 2)$ the gradient is $(8, 15)$. Choose $\mathbf{u} = (1, 0)$, the unit vector pointing along the $x_1$-axis. Then $D_{\mathbf{u}} f = (8)(1) + (15)(0) = 8$, exactly the partial derivative with respect to $x_1$, as it should be. Choose $\mathbf{u} = (0, 1)$ instead and you get $15$, the partial with respect to $x_2$. The maximum rate of change at this point, achieved along $\nabla f / \|\nabla f\| = (8, 15)/17$, is $\|\nabla f\| = \sqrt{64 + 225} = \sqrt{289} = 17$. No coordinate direction is steeper than that.
Level sets and orthogonality
A level set of $f$ is the collection of all points where $f$ takes a particular value $c$, written $\{\mathbf{x} : f(\mathbf{x}) = c\}$. On a contour map of a hill, the level sets are the contour lines, the curves of constant elevation. For a function of three variables, the level sets are surfaces; in higher dimensions, level sets are typically smooth $(n-1)$-dimensional shapes.
A simple but powerful theorem ties level sets to gradients: at every point, the gradient $\nabla f$ is orthogonal (perpendicular) to the level set passing through that point.
The proof is a one-liner. Walk along the level set. Because $f$ is constant on the level set, the directional derivative along any tangent direction must be zero. But the directional derivative is $\nabla f^\top \mathbf{u}$, so the gradient has zero dot product with every tangent, which is exactly what it means for the gradient to be perpendicular to the level set.
This orthogonality is doing serious work in the background of many modern ideas. It is why a contour plot of a loss surface, with arrows for the gradient drawn on it, always shows arrows piercing the contours at right angles. It is the geometric reason why the method of Lagrange multipliers works: at the optimum of a constrained problem, the gradient of the objective lines up with a combination of the gradients of the constraints. It is the foundation of the KKT conditions for inequality-constrained optimisation. And it is the starting intuition for gradient flows on manifolds, for instance, training a neural network whose weights are constrained to lie on a sphere or in some other curved space.
There is a related and equally useful fact about contour spacing. If two adjacent contours correspond to values $c$ and $c + \Delta c$, the perpendicular distance between them at a point is approximately $\Delta c / \|\nabla f\|$. Where the gradient is large, the contours are bunched tightly together, a steep slope; where the gradient is small, the contours are spread out, a gentle plain. This is the geometry behind the visual intuition that loss surfaces have valleys and ridges, and it explains why optimisation can stall on flat regions long before the loss has truly converged.
Worked example: gradient of a quadratic loss
Linear regression gives us an opportunity to compute a gradient that you will see again and again. The loss is
$$f(\mathbf{w}) = \tfrac{1}{2} \| \mathbf{X} \mathbf{w} - \mathbf{y} \|^2 = \tfrac{1}{2} (\mathbf{X}\mathbf{w} - \mathbf{y})^\top (\mathbf{X}\mathbf{w} - \mathbf{y}),$$
where $\mathbf{X}$ is the data matrix (rows = examples, columns = features), $\mathbf{w}$ is the parameter vector, and $\mathbf{y}$ is the vector of targets. The factor of $\tfrac{1}{2}$ is conventional and merely makes the gradient cleaner.
Differentiating with respect to $\mathbf{w}$, using the matrix-calculus shortcut from §2.9, gives
$$\nabla_{\mathbf{w}} f = \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y}).$$
Setting this gradient to zero (because the minimum of a smooth function is where its gradient vanishes) yields the normal equations:
$$\mathbf{X}^\top \mathbf{X} \mathbf{w} = \mathbf{X}^\top \mathbf{y}.$$
This is the closed-form least-squares solution. It is one of the few cases in machine learning where you can just solve for the optimum instead of iterating. Most loss surfaces are not so kind, and that is why we will need gradient descent.
Numerical check. Let
$$\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2 \\ 5 \end{pmatrix}.$$
Compute the bits one by one. $\mathbf{X}^\top \mathbf{X} = \begin{pmatrix} 1 \cdot 1 + 1 \cdot 1 & 1 \cdot 1 + 1 \cdot 2 \\ 1 \cdot 1 + 2 \cdot 1 & 1 \cdot 1 + 2 \cdot 2 \end{pmatrix} = \begin{pmatrix} 2 & 3 \\ 3 & 5 \end{pmatrix}$. And $\mathbf{X}^\top \mathbf{y} = \begin{pmatrix} 1 \cdot 2 + 1 \cdot 5 \\ 1 \cdot 2 + 2 \cdot 5 \end{pmatrix} = (7, 12)^\top$. The normal equations now read
$$\begin{pmatrix} 2 & 3 \\ 3 & 5 \end{pmatrix} \begin{pmatrix} w_1 \\ w_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 12 \end{pmatrix}.$$
Two equations, two unknowns: $2 w_1 + 3 w_2 = 7$ and $3 w_1 + 5 w_2 = 12$. Multiply the first by $3$ and the second by $2$: $6 w_1 + 9 w_2 = 21$ and $6 w_1 + 10 w_2 = 24$. Subtract to get $w_2 = 3$, and back-substitute to get $w_1 = -1$. So
$$\mathbf{w} = (-1, 3)^\top.$$
A quick sanity check: the line $y = -1 + 3 x$ passes through $(1, 2)$ and $(2, 5)$, exactly the two data points encoded by $\mathbf{X}$ and $\mathbf{y}$. Two points, two unknowns, perfect fit. This is the same answer you would get from the two-point example in §3.1.
Sketch of gradient descent
When the loss surface does not admit a tidy closed-form minimum (and for almost any neural network, it does not), we minimise $f$ iteratively. The plainest method is gradient descent:
- Start at some initial point $\mathbf{x}_0$.
- Compute the gradient $\nabla f(\mathbf{x}_t)$.
- Update $\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \nabla f(\mathbf{x}_t)$.
- Repeat until convergence (or until you run out of patience).
The step size $\eta > 0$ is the learning rate. Choose it small enough and each step actually decreases the loss; choose it too large and you overshoot, oscillate, or diverge. For convex $f$ with sufficiently small $\eta$, convergence to the global minimum is guaranteed. For the non-convex losses that dominate modern ML, no such guarantee exists, but the method works well in practice. The careful theory and the practical variants, momentum, RMSProp, Adam, learning-rate schedules, fill §3.9.
To see gradient descent in action, return to the regression example. Start at $\mathbf{w}_0 = (0, 0)$, with $\eta = 0.1$. Compute $\nabla f(\mathbf{w}_0) = \mathbf{X}^\top (\mathbf{X} \mathbf{w}_0 - \mathbf{y}) = -\mathbf{X}^\top \mathbf{y} = (-7, -12)^\top$. Step: $\mathbf{w}_1 = (0,0) - 0.1 \cdot (-7, -12) = (0.7, 1.2)$. Compute the new gradient and step again. After ten or so iterations the parameters have crept close to $(-1, 3)$, the same closed-form solution we derived above, recovered by following the negative gradient downhill. Every neural network you have ever heard of was trained by exactly this procedure, scaled up.
What you should take away
- A partial derivative is a single-variable derivative in disguise: hold every other variable fixed and differentiate in the usual way.
- The gradient $\nabla f$ collects all partial derivatives into a vector, the same shape as the input. For a loss with a billion parameters, the gradient also has a billion entries.
- The gradient points in the direction of steepest ascent; its negative points in the direction of steepest descent. Its norm is the steepest rate of change.
- Directional derivatives read off the rate of change along any unit vector: $D_{\mathbf{u}} f = \nabla f^\top \mathbf{u}$. The gradient is orthogonal to level sets, which is the geometric backbone of optimisation under constraints.
- Gradient descent is the simplest possible algorithm that uses these ideas: at each step, compute the gradient and move a small distance in its negative direction. Almost every modern training loop is a refinement of this one line.