Newton's Method, Glossary, Textbook of AI

Newton's method (for optimisation, also called Newton–Raphson for root finding) uses second-order information to take more accurate steps than gradient descent. To minimise $f(\theta)$:

$$\theta_{t+1} = \theta_t - [\nabla^2 f(\theta_t)]^{-1} \nabla f(\theta_t)$$

Derivation: at $\theta_t$, the second-order Taylor expansion of $f$ is

$$f(\theta) \approx f(\theta_t) + \nabla f(\theta_t)^\top (\theta - \theta_t) + \frac{1}{2} (\theta - \theta_t)^\top \nabla^2 f(\theta_t) (\theta - \theta_t)$$

The minimiser of this quadratic approximation is the Newton update.

Convergence: near a strict local minimum (with positive definite Hessian), Newton's method converges quadratically, the number of correct digits roughly doubles each iteration. Compare to gradient descent's linear convergence ($O(\rho^t)$).

Pros:

Quadratic local convergence, extremely fast near the optimum.
Affine-invariant: independent of the parameter scaling.
Self-tuning: no learning rate to choose.

Cons:

Hessian computation is $O(n^2)$ in parameters and $O(n^3)$ to invert. Prohibitive for $n > 10^4$.
Hessian storage is $O(n^2)$ memory. Prohibitive for $n > 10^5$.
Hessian must be positive definite for the update to descend. At saddle points (common in deep learning), Newton's method moves toward them.

Quasi-Newton methods approximate the Hessian inverse iteratively without explicit second derivatives:

BFGS (Broyden-Fletcher-Goldfarb-Shanno): maintains a rank-2 update of an approximate inverse Hessian. $O(n^2)$ memory, $O(n^2)$ per step. Practical for $n \leq 10^4$.

L-BFGS (limited-memory BFGS): stores only the last $m$ pairs of gradients and parameter changes ($m \approx 10$). $O(mn)$ memory and per-step cost. Practical for $n \leq 10^7$.

Conjugate gradient: another iterative approach, often used as a sub-solver inside Newton's method (for solving the linear system $H d = g$ for the Newton direction $d$ without forming or inverting $H$).

Hessian-vector products $H v$ can be computed in $O(n)$ time via Pearlmutter's trick (forward-mode AD on backward-mode AD), enabling Hessian-free Newton methods that scale to large neural networks.

In modern deep learning:

Practical neural-network training rarely uses Newton's method because the Hessian is too large and indefinite at saddle points.
K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information matrix as a Kronecker product, enabling natural-gradient-style updates.
Shampoo, SOAP: approximate second-order methods with structured Hessian approximations. Compete with Adam in some regimes.
First-order methods (SGD, Adam, AdamW) dominate for neural network training due to their scalability. Second-order methods remain dominant for classical convex problems (logistic regression, SVMs at moderate scale).

Trust region methods add a constraint $\|\theta - \theta_t\| \leq \Delta$ to the Newton step, increasing $\Delta$ on success and decreasing on failure. More robust than pure Newton, particularly far from the optimum.

Related terms: Gradient Descent, Hessian, Convex Optimisation

Discussed in:

Chapter 10: Training & Optimisation, Training Optimisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).