Gradient, Glossary, Textbook of AI

The gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ at a point $x$ is the vector of partial derivatives:

$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right).$$

The gradient points in the direction of steepest ascent of $f$, moving from $x$ in direction $\nabla f(x)$ increases $f$ most rapidly. Conversely $-\nabla f(x)$ is the direction of steepest descent, which is why gradient-descent optimisation moves in the negative gradient direction.

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the analogous object is the Jacobian matrix $J_f \in \mathbb{R}^{m \times n}$ whose $(i,j)$ entry is $\partial f_i / \partial x_j$. For higher-order derivatives, the Hessian $H_f \in \mathbb{R}^{n \times n}$ contains the second partial derivatives $\partial^2 f / \partial x_i \partial x_j$.

In machine learning, $f$ is typically a loss function and $x$ are the parameters; computing $\nabla f$ efficiently is what backpropagation does. The norm $\|\nabla f\|$ measures how steep the loss surface is locally, large gradient norm means rapid change, small means flat. Gradient clipping caps $\|\nabla f\|$ at a threshold to prevent exploding-gradient instabilities.

Directional derivatives along a unit vector $u$ are inner products with the gradient: $D_u f(x) = u \cdot \nabla f(x)$. The level sets of $f$, curves/surfaces of constant value, are everywhere perpendicular to the gradient. The gradient also satisfies the linearisation $f(x + \delta) \approx f(x) + \delta \cdot \nabla f(x) + O(\|\delta\|^2)$, the basis of every first-order optimisation method.

Interactive

One neuron, forward pass and backward pass. Forward pass produces a value; backprop sends gradients back through the same graph.

The gradient as a vector field. Tiny arrows on a contour plot point uphill, the steepest ascent direction at every location.

Partial derivatives slice a surface. Hold one variable, slope along the other. Two partials make the gradient.

Video

Related terms: Chain Rule, Gradient Descent, Backpropagation, Jacobian, Hessian

Discussed in:

Chapter 3: Calculus, Calculus

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.