The gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ at a point $x$ is the vector of partial derivatives:
$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right).$$
The gradient points in the direction of steepest ascent of $f$, moving from $x$ in direction $\nabla f(x)$ increases $f$ most rapidly. Conversely $-\nabla f(x)$ is the direction of steepest descent, which is why gradient-descent optimisation moves in the negative gradient direction.
For a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the analogous object is the Jacobian matrix $J_f \in \mathbb{R}^{m \times n}$ whose $(i,j)$ entry is $\partial f_i / \partial x_j$. For higher-order derivatives, the Hessian $H_f \in \mathbb{R}^{n \times n}$ contains the second partial derivatives $\partial^2 f / \partial x_i \partial x_j$.
In machine learning, $f$ is typically a loss function and $x$ are the parameters; computing $\nabla f$ efficiently is what backpropagation does. The norm $\|\nabla f\|$ measures how steep the loss surface is locally, large gradient norm means rapid change, small means flat. Gradient clipping caps $\|\nabla f\|$ at a threshold to prevent exploding-gradient instabilities.
Directional derivatives along a unit vector $u$ are inner products with the gradient: $D_u f(x) = u \cdot \nabla f(x)$. The level sets of $f$, curves/surfaces of constant value, are everywhere perpendicular to the gradient. The gradient also satisfies the linearisation $f(x + \delta) \approx f(x) + \delta \cdot \nabla f(x) + O(\|\delta\|^2)$, the basis of every first-order optimisation method.
Interactive
Video
Related terms: Chain Rule, Gradient Descent, Backpropagation, Jacobian, Hessian
Discussed in:
- Chapter 3: Calculus, Calculus