For minimising $f(\theta) = \frac{1}{N} \sum_{n=1}^N f_n(\theta)$ (sum of per-example losses), stochastic gradient descent (SGD) uses gradient estimates from random subsets:
$$\theta_{t+1} = \theta_t - \eta_t \nabla f_{\mathcal{B}_t}(\theta_t)$$
where $\mathcal{B}_t$ is a mini-batch of $B$ examples drawn uniformly at random and $\nabla f_{\mathcal{B}_t} = \frac{1}{B} \sum_{n \in \mathcal{B}_t} \nabla f_n$.
The mini-batch gradient is unbiased: $\mathbb{E}_{\mathcal{B}}[\nabla f_{\mathcal{B}}(\theta)] = \nabla f(\theta)$. Variance scales as $\sigma^2 / B$ where $\sigma^2$ is the per-example gradient variance.
Convergence (convex case). For Lipschitz-smooth convex $f$ with constant step size $\eta < 1/L$, SGD's expected suboptimality decays as $\mathbb{E}[f(\bar\theta_T) - f^*] = O(1/\sqrt{T})$, the same rate as full gradient descent up to constants. With decreasing step size $\eta_t = O(1/t)$, the rate is $O((\log T)/T)$.
Convergence (non-convex case, e.g. neural networks). SGD finds a stationary point ($\|\nabla f\| \to 0$) but offers no guarantee on global optimality. Empirically, the flat minima SGD finds (regions where the loss is gently varying) generalise better than sharper minima, a property believed to come from SGD's noise acting as implicit regularisation.
Momentum (Polyak 1964) accelerates SGD by maintaining a velocity vector:
$$v_{t+1} = \mu v_t + (1 - \mu) \nabla f_{\mathcal{B}_t}(\theta_t)$$ $$\theta_{t+1} = \theta_t - \eta v_{t+1}$$
With $\mu \approx 0.9$, momentum smooths gradient noise and accelerates progress in narrow valleys (where consecutive gradients align). For convex $f$, momentum achieves the optimal $O(1/T^2)$ rate against $O(1/T)$ for plain GD.
Nesterov momentum evaluates the gradient at a look-ahead point:
$$v_{t+1} = \mu v_t + \nabla f(\theta_t + \mu v_t)$$ $$\theta_{t+1} = \theta_t - \eta v_{t+1}$$
Tighter convergence bounds than Polyak momentum.
Learning-rate schedules are crucial in practice:
- Linear warm-up for the first ~1000 steps, then cosine decay: $\eta_t = \eta_0 \cdot \frac{1 + \cos(\pi t / T)}{2}$.
- Step decay: $\eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor}$ at fixed milestones.
- Inverse-square-root: $\eta_t = \eta_0 / \sqrt{t}$, used in the original Transformer paper.
Gradient clipping: cap $\|\nabla f\|$ at threshold $c$ to prevent exploding-gradient instabilities. Standard for training large language models and RNNs.
Gradient accumulation: sum gradients over $K$ micro-batches before stepping, simulating an effective batch size $K \cdot B$. Essential for fitting large models that cannot accommodate the desired batch size in GPU memory.
Related terms: Gradient Descent, Adam, Learning Rate
Discussed in:
- Chapter 10: Training & Optimisation, Training Optimisation