Tiny arrows on a contour plot point uphill, the steepest ascent direction at every location.
From the chapter: Chapter 3: Calculus
Glossary: gradient, partial derivative
Transcript
A function of two variables draws a surface. Slice it horizontally and you get contour lines, like the rings on a topographic map.
At every point on this map we can ask: which direction is uphill the fastest.
The answer is the gradient. A vector that points perpendicular to the contour line, away from low values toward high.
We draw this vector at a grid of points. Long arrows where the surface is steep. Short arrows where it is nearly flat.
Now look at the saddle. The gradient flips direction as we cross it. Look at the minimum. The arrows all point outward, away from the basin. Look at the maximum. The arrows all point inward, toward the peak.
Gradient descent walks against this field. It steps the negative of the gradient and moves downhill until it reaches a place where the gradient is zero.
Stationary points come in three flavours: minima, maxima, and saddles. A neural network's loss surface has trillions of them.
The gradient does not know which is which. It only knows uphill, locally. Everything in optimisation is built on this one fact.