Gradient descent is the foundational optimisation algorithm of machine learning. To minimise a differentiable loss $L(\theta)$ over parameters $\theta \in \mathbb{R}^d$, the algorithm iteratively updates
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$
where $\eta > 0$ is the learning rate and $\nabla L(\theta_t)$ is the gradient of $L$ at $\theta_t$. The negative gradient is the direction of steepest descent.
For convex $L$ with L-Lipschitz gradient, gradient descent with $\eta = 1/L$ converges at rate $O(1/T)$, the loss decreases as $1/T$ after $T$ steps. With strong convexity, convergence becomes geometric: $O(\rho^T)$ for some $\rho < 1$. Neural-network losses are non-convex, but empirically gradient descent and its variants reach good solutions.
Variants: Stochastic gradient descent (SGD): replace the full-batch gradient with a single-example or mini-batch estimate, dramatically reducing per-step cost at the price of gradient noise. Momentum: accumulate a velocity vector $v_{t+1} = \mu v_t - \eta \nabla L(\theta_t)$, $\theta_{t+1} = \theta_t + v_{t+1}$, with $\mu \in [0,1)$. Smooths gradient noise and accelerates progress in narrow valleys. Nesterov momentum: evaluate the gradient at the look-ahead point $\theta_t + \mu v_t$ rather than $\theta_t$. Adam: per-parameter adaptive learning rates from running estimates of the first and second moments of the gradient. Adagrad / RMSProp: variants of adaptive learning rate.
Learning-rate schedules are critical for deep networks: linear warm-up over the first thousand steps, cosine decay over training, or step decay at fixed milestones. The effective learning rate at a given step matters as much as the optimiser choice.
For very large models, gradient accumulation (sum gradients over many micro-batches before stepping) and gradient clipping (cap the gradient norm at a threshold) are essential engineering details.
Interactive
Video
Related terms: Stochastic Gradient Descent, Adam, Backpropagation, Chain Rule, Learning Rate
Discussed in:
- Chapter 10: Training & Optimisation, Training Optimisation