Gradient Descent is the optimisation algorithm that makes modern machine learning possible. Starting from an initial parameter vector $\boldsymbol{\theta}$, it repeatedly updates the parameters by taking a small step in the direction of the negative gradient: $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} L$, where $\eta$ is the learning rate—a positive scalar controlling step size. Each update reduces the loss locally, and iterating converges to a (local) minimum.
The learning rate is the most important hyperparameter in deep learning. Too large and the updates overshoot, causing the loss to oscillate or diverge; too small and convergence is painfully slow. Learning rate schedules—strategies that adjust $\eta$ during training, such as step decay, cosine annealing, or warmup followed by decay—are essential for achieving good final performance.
In its pure form, gradient descent computes the gradient over the entire training set (batch gradient descent). This is prohibitively expensive for large datasets. Stochastic gradient descent (SGD) instead estimates the gradient from a single random example, or more commonly a mini-batch of 32 to 512 examples. The resulting gradient is noisy but in expectation correct, and the noise actually helps: it enables the optimiser to escape shallow local minima and tends to find flatter minima that generalise better. Adaptive variants such as Adam, AdaGrad, and RMSProp further improve convergence by giving each parameter its own effective learning rate.
Related terms: Stochastic Gradient Descent, Adam, Learning Rate, Gradient, Loss Function
Discussed in:
- Chapter 3: Calculus — Gradient Descent
- Chapter 10: Training & Optimisation — Stochastic Gradient Descent
Also defined in: Textbook of AI, Textbook of Medical AI