Plain SGD oscillates across the walls; momentum smooths the path along the valley floor.
From the chapter: Chapter 10: Training & Optimisation
Glossary: momentum
Transcript
A loss surface with a long, narrow valley. The valley is gentle along its length, steep across its width.
Plain stochastic gradient descent. At each step, follow the negative gradient. Across the valley, the gradient is enormous; the step jumps to the other wall. Then back. The trajectory bounces side to side, making slow progress along the valley.
Add momentum. Maintain a velocity vector. At each step, accumulate the gradient into the velocity, with a decay factor, typically 0.9. Then move in the direction of the velocity.
The velocity averages many recent gradients. Across the valley, the gradients alternate sign and cancel. Along the valley, they all point the same way and add up.
The result. The trajectory cuts through the oscillations and rolls smoothly along the valley floor. Convergence is dramatically faster.
Watch a 2D contour plot with both methods running side by side. SGD bounces erratically. Momentum coasts.
Nesterov's accelerated gradient is a refinement: take the gradient at the lookahead position, where momentum will carry you next. Slightly faster in theory, often used in practice.
Momentum is the simplest second-order trick. Adam, RMSProp, and friends are momentum with adaptive per-parameter scaling. The intuition starts here.