Newton's method (for optimisation, also called Newton–Raphson for root finding) uses second-order information to take more accurate steps than gradient descent. To minimise $f(\theta)$:
$$\theta_{t+1} = \theta_t - [\nabla^2 f(\theta_t)]^{-1} \nabla f(\theta_t)$$
Derivation: at $\theta_t$, the second-order Taylor expansion of $f$ is
$$f(\theta) \approx f(\theta_t) + \nabla f(\theta_t)^\top (\theta - \theta_t) + \frac{1}{2} (\theta - \theta_t)^\top \nabla^2 f(\theta_t) (\theta - \theta_t)$$
The minimiser of this quadratic approximation is the Newton update.
Convergence: near a strict local minimum (with positive definite Hessian), Newton's method converges quadratically, the number of correct digits roughly doubles each iteration. Compare to gradient descent's linear convergence ($O(\rho^t)$).
Pros:
- Quadratic local convergence, extremely fast near the optimum.
- Affine-invariant: independent of the parameter scaling.
- Self-tuning: no learning rate to choose.
Cons:
- Hessian computation is $O(n^2)$ in parameters and $O(n^3)$ to invert. Prohibitive for $n > 10^4$.
- Hessian storage is $O(n^2)$ memory. Prohibitive for $n > 10^5$.
- Hessian must be positive definite for the update to descend. At saddle points (common in deep learning), Newton's method moves toward them.
Quasi-Newton methods approximate the Hessian inverse iteratively without explicit second derivatives:
BFGS (Broyden-Fletcher-Goldfarb-Shanno): maintains a rank-2 update of an approximate inverse Hessian. $O(n^2)$ memory, $O(n^2)$ per step. Practical for $n \leq 10^4$.
L-BFGS (limited-memory BFGS): stores only the last $m$ pairs of gradients and parameter changes ($m \approx 10$). $O(mn)$ memory and per-step cost. Practical for $n \leq 10^7$.
Conjugate gradient: another iterative approach, often used as a sub-solver inside Newton's method (for solving the linear system $H d = g$ for the Newton direction $d$ without forming or inverting $H$).
Hessian-vector products $H v$ can be computed in $O(n)$ time via Pearlmutter's trick (forward-mode AD on backward-mode AD), enabling Hessian-free Newton methods that scale to large neural networks.
In modern deep learning:
- Practical neural-network training rarely uses Newton's method because the Hessian is too large and indefinite at saddle points.
- K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information matrix as a Kronecker product, enabling natural-gradient-style updates.
- Shampoo, SOAP: approximate second-order methods with structured Hessian approximations. Compete with Adam in some regimes.
- First-order methods (SGD, Adam, AdamW) dominate for neural network training due to their scalability. Second-order methods remain dominant for classical convex problems (logistic regression, SVMs at moderate scale).
Trust region methods add a constraint $\|\theta - \theta_t\| \leq \Delta$ to the Newton step, increasing $\Delta$ on success and decreasing on failure. More robust than pure Newton, particularly far from the optimum.
Related terms: Gradient Descent, Hessian, Convex Optimisation
Discussed in:
- Chapter 10: Training & Optimisation, Training Optimisation