10.4 Convergence analysis of SGD
We sketch the two canonical results: convex and non-convex.
Convex case: $O(1/\sqrt{T})$
Assume $L$ is convex (not necessarily smooth) and the gradient estimate is bounded: $\mathbb{E}\|\hat{g}\|^2 \le G^2$. Let $\bar\theta_T = \tfrac{1}{T} \sum_{t=0}^{T-1} \theta_t$ be the iterate average. Then with constant step $\eta = D/(G\sqrt{T})$, where $D = \|\theta_0 - \theta^\star\|$,
$$\mathbb{E}[L(\bar\theta_T)] - L(\theta^\star) \le \frac{D G}{\sqrt{T}}.$$
The proof uses the standard "potential function" argument. Let $\Phi_t = \|\theta_t - \theta^\star\|^2$. Then
$$\Phi_{t+1} = \|\theta_t - \eta \hat{g}_t - \theta^\star\|^2 = \Phi_t - 2\eta (\theta_t - \theta^\star)^\top \hat{g}_t + \eta^2 \|\hat{g}_t\|^2.$$
Taking expectations and using convexity, $\mathbb{E}[(\theta_t - \theta^\star)^\top \hat{g}_t] \ge \mathbb{E}[L(\theta_t)] - L(\theta^\star)$. Telescoping and dividing by $T$ yields the bound.
The rate is $O(1/\sqrt{T})$, slower than full-batch gradient descent's $O(1/T)$. This is the fundamental cost of stochasticity, and it is unavoidable: any method using only stochastic first-order information has a $\Omega(1/\sqrt{T})$ lower bound (Nemirovski and Yudin 1983).
Non-convex case: $\|\nabla L\| \to 0$
For $\beta$-smooth, possibly non-convex $L$ with bounded gradient noise, SGD with step $\eta = c/\sqrt{T}$ achieves
$$\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}\|\nabla L(\theta_t)\|^2 = O\!\left(\frac{1}{\sqrt{T}}\right).$$
So the average squared gradient norm decays at rate $1/\sqrt{T}$. The proof uses the descent lemma applied in expectation:
$$\mathbb{E}[L(\theta_{t+1})] \le \mathbb{E}[L(\theta_t)] - \eta \mathbb{E}\|\nabla L(\theta_t)\|^2 + \frac{\beta\eta^2}{2}(\|\nabla L(\theta_t)\|^2 + \sigma^2/B),$$
then telescoping. The first term contributes the $O(1/\sqrt{T})$ rate; the variance term contributes a noise floor proportional to $\sigma^2/B$. This noise floor is why we typically anneal the learning rate or increase the batch size as training progresses.