Glossary

Gradient Descent

Gradient descent is the foundational optimisation algorithm of machine learning. To minimise a differentiable loss $L(\theta)$ over parameters $\theta \in \mathbb{R}^d$, the algorithm iteratively updates

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

where $\eta > 0$ is the learning rate and $\nabla L(\theta_t)$ is the gradient of $L$ at $\theta_t$. The negative gradient is the direction of steepest descent.

For convex $L$ with L-Lipschitz gradient, gradient descent with $\eta = 1/L$ converges at rate $O(1/T)$, the loss decreases as $1/T$ after $T$ steps. With strong convexity, convergence becomes geometric: $O(\rho^T)$ for some $\rho < 1$. Neural-network losses are non-convex, but empirically gradient descent and its variants reach good solutions.

Variants: Stochastic gradient descent (SGD): replace the full-batch gradient with a single-example or mini-batch estimate, dramatically reducing per-step cost at the price of gradient noise. Momentum: accumulate a velocity vector $v_{t+1} = \mu v_t - \eta \nabla L(\theta_t)$, $\theta_{t+1} = \theta_t + v_{t+1}$, with $\mu \in [0,1)$. Smooths gradient noise and accelerates progress in narrow valleys. Nesterov momentum: evaluate the gradient at the look-ahead point $\theta_t + \mu v_t$ rather than $\theta_t$. Adam: per-parameter adaptive learning rates from running estimates of the first and second moments of the gradient. Adagrad / RMSProp: variants of adaptive learning rate.

Learning-rate schedules are critical for deep networks: linear warm-up over the first thousand steps, cosine decay over training, or step decay at fixed milestones. The effective learning rate at a given step matters as much as the optimiser choice.

For very large models, gradient accumulation (sum gradients over many micro-batches before stepping) and gradient clipping (cap the gradient norm at a threshold) are essential engineering details.

Interactive

Gradient descent on a quadratic bowl. A ball rolls down a quadratic surface as the learning rate changes.

Video

Related terms: Stochastic Gradient Descent, Adam, Backpropagation, Chain Rule, Learning Rate

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.