Calculus: 3.10   Higher-order derivatives, the Hessian, and Newton's method

Dr Chris Paton

3.10 Higher-order derivatives, the Hessian, and Newton's method

Up to this point our optimisation toolkit has rested on a single object: the gradient, the vector of first derivatives that points uphill. Gradient descent, which we met in §3.9, takes a small step in the opposite direction and repeats. It is simple, cheap, and good enough to train almost every neural network in production today. But it is also, in a precise sense, the laziest sensible thing one can do. It uses only the slope of the loss surface, ignoring how that slope is itself changing. A good walker, descending a hill in fog, does not just feel which way is downhill; she also feels whether the ground is curving sharply or flattening out, and adjusts her stride accordingly. Methods that do the same, that look at the second derivative as well as the first, are called second-order methods, and on well-behaved problems they converge dramatically faster than gradient descent.

§3.4 introduced the gradient $\nabla f$, the vector of first partial derivatives. §3.9 used the gradient to drive iterative descent. §3.10 introduces the Hessian, the matrix of second partial derivatives, and the canonical second-order step, Newton's method. The reason second-order methods do not currently dominate deep learning is purely one of scale: a Hessian for a billion-parameter network has $10^{18}$ entries, and we cannot store it, never mind invert it. But cheap approximations to the Hessian are everywhere in modern practice, Adam's per-parameter learning rates, K-FAC, Shampoo, natural gradient, and Chapter 10 returns to them when we discuss optimisation for large-scale training. Treat this section as the calculus foundation those approximations rest on.

Symbols Used Here

$f(\mathbf{x})$scalar function of a vector

$\nabla f$gradient (vector of first partial derivatives)

$\mathbf{H}$Hessian matrix, $H_{ij} = \partial^2 f / \partial x_i \partial x_j$

$\boldsymbol{\theta}$parameters being optimised

$\mathbf{F}$Fisher information matrix

The Hessian

The Hessian of a scalar function $f : \mathbb{R}^n \to \mathbb{R}$ is the $n \times n$ matrix whose entries are the second partial derivatives of $f$. We write

$$ \mathbf{H}_{ij} \;=\; \frac{\partial^2 f}{\partial x_i \, \partial x_j}. $$

If the gradient is the vector that tells us in which direction $f$ increases fastest, the Hessian is the matrix that tells us how that direction is itself bending. Each diagonal entry $H_{ii}$ is the curvature of $f$ along axis $i$, large positive means a steep upward bowl in that direction, large negative means a steep downward bowl, near zero means a flat valley. The off-diagonal entries $H_{ij}$ for $i \neq j$ measure how curvature in direction $i$ couples with movement in direction $j$.

A small but important fact is that the Hessian is symmetric, $H_{ij} = H_{ji}$, whenever the second partial derivatives are continuous. This is Schwarz's theorem: the order of partial differentiation does not matter under mild smoothness assumptions, which hold for the loss functions we meet in machine learning. Symmetry has practical consequences. A real symmetric matrix has real eigenvalues and an orthogonal eigenbasis, so the local geometry of $f$ decomposes neatly into independent curvatures along principal axes. We will rely on this structure repeatedly.

Let us work an example. Consider

$$ f(x_1, x_2) \;=\; x_1^2 + 2 x_1 x_2 + 3 x_2^2. $$

The gradient is obtained by differentiating with respect to each variable in turn:

$$ \nabla f \;=\; \begin{pmatrix} \partial f / \partial x_1 \\ \partial f / \partial x_2 \end{pmatrix} \;=\; \begin{pmatrix} 2 x_1 + 2 x_2 \\ 2 x_1 + 6 x_2 \end{pmatrix}. $$

To get the Hessian, we differentiate each entry of the gradient again, with respect to each variable in turn. The top-left entry is $\partial / \partial x_1 (2 x_1 + 2 x_2) = 2$. The top-right is $\partial / \partial x_2 (2 x_1 + 2 x_2) = 2$. The bottom-left is $\partial / \partial x_1 (2 x_1 + 6 x_2) = 2$. The bottom-right is $\partial / \partial x_2 (2 x_1 + 6 x_2) = 6$. Assembling,

$$ \mathbf{H} \;=\; \begin{pmatrix} 2 & 2 \\ 2 & 6 \end{pmatrix}. $$

Notice that the Hessian is symmetric, as predicted, and that it is constant, it does not depend on $x_1$ or $x_2$. That is because $f$ is a quadratic function. For non-quadratic $f$, such as the loss surface of a neural network, the Hessian varies from point to point, and we have to evaluate it afresh wherever we want to use it.

Eigenvalues of the Hessian = curvature

The Hessian, being a real symmetric matrix, can be diagonalised by an orthogonal change of basis. Its eigenvalues describe the principal curvatures of $f$ at the point under consideration, and their signs classify the geometry:

All eigenvalues positive, $f$ curves upwards in every direction, and we are at a (strict) local minimum. The Hessian is then called positive definite, written $\mathbf{H} \succ 0$.
All eigenvalues negative, $f$ curves downwards in every direction; we are at a local maximum, and $\mathbf{H} \prec 0$.
Mixed positive and negative eigenvalues, $f$ curves up in some directions and down in others; we are at a saddle point.
Some eigenvalues exactly zero, the Hessian is degenerate along those directions, and second-order information is not enough to classify the critical point. Higher-order terms in the Taylor expansion are needed.

This classification is the multivariable analogue of the second-derivative test from school calculus, where $f''(x) > 0$ at a critical point means a minimum and $f''(x) < 0$ means a maximum. In several variables, "positive" and "negative" become "positive definite" and "negative definite", with the saddle case appearing as a genuinely new possibility.

For the example above with $\mathbf{H} = \begin{pmatrix} 2 & 2 \\ 2 & 6 \end{pmatrix}$, the eigenvalues solve

$$ \det(\mathbf{H} - \lambda \mathbf{I}) \;=\; (2 - \lambda)(6 - \lambda) - 4 \;=\; \lambda^2 - 8\lambda + 8 \;=\; 0, $$

giving $\lambda = 4 \pm 2\sqrt{2}$, i.e. approximately $6.83$ and $1.17$. Both are positive, so this Hessian is positive definite; the function has a local minimum at the only critical point, the origin. In this case it is also the global minimum, because the function is a quadratic with positive-definite Hessian.

The ratio of the largest to smallest eigenvalue, $6.83 / 1.17 \approx 5.83$, is called the condition number of the Hessian. It is a single scalar that tells us how stretched the loss landscape is in different directions. A condition number close to one means the contours are nearly circular and gradient descent will move smoothly towards the minimum. A large condition number means the contours are long thin ellipses; gradient descent zig-zags inefficiently across the narrow valley while making slow progress along its length. The condition number controls the speed of first-order optimisation, and reducing it, by preconditioning, is one of the main jobs of a modern optimiser.

Newton's method

Where gradient descent uses only $\nabla f$ to choose its step, Newton's method uses both $\nabla f$ and $\mathbf{H}$. The update rule is

$$ \boldsymbol{\theta}_{t+1} \;=\; \boldsymbol{\theta}_t \;-\; \mathbf{H}^{-1} \nabla f. $$

Geometrically, what is going on is this. Around the current point, we approximate $f$ by its second-order Taylor expansion, which is a multivariate quadratic. A quadratic has a unique minimum (provided its Hessian is positive definite), and that minimum can be solved for in closed form. Newton's method jumps directly to it. The next iteration repeats the procedure with a fresh quadratic centred on the new point. Where gradient descent feels its way downhill in cautious small steps, Newton's method fits a parabolic bowl to the local landscape and leaps to its bottom.

The theoretical payoff is enormous. Near a strict local minimum where the Hessian is positive definite, Newton's method converges quadratically: the distance to the optimum is squared at each step, so the number of correct digits roughly doubles per iteration. Plain gradient descent only manages linear convergence, the error shrinks by a constant factor per step. A handful of Newton iterations can match what would take thousands of gradient steps.

The catch is cost. Forming the Hessian for an $n$-dimensional problem requires $O(n^2)$ second derivatives. Inverting it (or, equivalently, solving the linear system $\mathbf{H} \mathbf{d} = -\nabla f$ for the search direction $\mathbf{d}$) costs $O(n^3)$ in general. For a logistic regression with $n = 100$ coefficients, that is trivial. For a neural network with $n = 10^9$ parameters, both numbers are astronomical: the Hessian alone would occupy $10^{18}$ entries. This is why pure Newton is the workhorse of classical statistics, it is what fits generalised linear models, logistic regression, and many maximum-likelihood problems in scipy and statsmodels, but it is unusable at the scale of modern deep learning.

A worked one-dimensional example will fix the idea. Take $f(\theta) = (\theta - 3)^2$, a simple quadratic with minimum at $\theta = 3$. Its first derivative is $f'(\theta) = 2(\theta - 3)$ and its second is $f''(\theta) = 2$, constant. Starting from $\theta_0 = 0$, the gradient is $f'(0) = -6$ and the Hessian (in one dimension, just a scalar) is $2$. The Newton update is

$$ \theta_1 \;=\; \theta_0 - \frac{f'(\theta_0)}{f''(\theta_0)} \;=\; 0 - \frac{-6}{2} \;=\; 3. $$

We arrive at the exact minimum in one step. This is not a coincidence. For any quadratic function in any number of variables, Newton's method converges in a single iteration, because the second-order Taylor expansion of a quadratic is the quadratic itself. For non-quadratic $f$, the first iteration is approximate, but the approximation rapidly improves as we get close to the optimum.

Quasi-Newton: BFGS, L-BFGS

Between the cheapness of gradient descent and the expense of true Newton sit the quasi-Newton methods. The idea is to build up an approximation to the Hessian, or, more usefully, to its inverse, using only gradients evaluated at successive iterates. Each step, we observe how the gradient changed in response to the parameter change, and update our estimate of the curvature accordingly. No second derivatives are ever computed.

The most famous quasi-Newton method is BFGS (named for Broyden, Fletcher, Goldfarb, and Shanno, who independently proposed equivalent updates in 1970). BFGS maintains an explicit $n \times n$ approximation to $\mathbf{H}^{-1}$, updated by a rank-two correction at each step. It enjoys most of Newton's fast convergence in practice, without ever needing to compute or invert a Hessian. The price is the $O(n^2)$ memory for the approximation, fine for $n = 1000$, painful for $n = 10^7$, impossible for $n = 10^9$.

L-BFGS ("limited-memory BFGS") fixes that. Instead of storing the full $n \times n$ matrix, it keeps only the last few, typically five to twenty, pairs of parameter and gradient differences, and reconstructs the action of the inverse Hessian on a vector from those. Memory drops to $O(n)$ rather than $O(n^2)$. L-BFGS is the default optimiser for many classical convex problems (scipy.optimize.minimize uses it as standard), and it occasionally appears for small-to-medium deep learning problems where batch optimisation is appropriate. For mainstream stochastic deep learning it has largely lost out to Adam and its descendants, partly because L-BFGS does not handle gradient noise gracefully.

Why second-order methods aren't standard for deep learning

Modern neural networks have between $10^9$ and $10^{12}$ parameters. The Hessian of such a network is a square matrix of that side length, with up to $10^{24}$ entries. We cannot write it down, store it, or invert it on any conceivable hardware. Even L-BFGS, with its $O(n)$ memory, requires storing dozens of full gradient vectors per step, the equivalent of dozens of complete copies of the model. The combination of stochastic mini-batch gradients (which inject noise into any second-order estimate), enormous parameter count, and the need to update many times per second, has kept first-order methods firmly dominant.

What modern optimisers do instead is approximate second-order information cheaply:

Adam maintains a running estimate of the diagonal of the Hessian (strictly, of the squared gradient) and uses it as a per-parameter learning rate. This is equivalent to assuming the Hessian is diagonal and ignoring all cross-coupling, crude, but cheap and surprisingly effective.
K-FAC (Kronecker-factored Approximate Curvature) approximates the Hessian as block-diagonal across layers, with each block written as a Kronecker product of two much smaller matrices. The Kronecker factorisation reduces both storage and inversion costs by orders of magnitude per layer.
Shampoo maintains layer-wise second-moment statistics with full preconditioning along each tensor mode, sitting between diagonal and full second-order.
Natural gradient uses the Fisher information matrix $\mathbf{F}$, which for probabilistic models coincides with the expected Hessian of the negative log-likelihood. The natural-gradient update $\mathbf{F}^{-1} \nabla f$ corrects for the geometry of the parameter space, and is the parent of K-FAC, TRPO, and several modern reinforcement-learning algorithms.

These methods all sacrifice fidelity to the true Hessian for tractability, and in return recover a portion of Newton's fast convergence. Chapter 10 returns to them in detail.

Saddle points and the loss landscape

There is one more reason second-order methods do not transfer cleanly from textbook optimisation to deep learning: in high dimensions the geometry is genuinely strange. In two dimensions, most critical points of a generic function are minima or maxima, with saddles being relatively rare. In several thousand dimensions, the situation reverses. For a critical point to be a local minimum, all $n$ eigenvalues of the Hessian must be positive. If the eigenvalue signs are roughly random, the probability of all $n$ being positive falls off exponentially with $n$. The consequence: in a deep network, almost every critical point of the loss is a saddle, and true local minima are exponentially rare.

This matters in two ways. First, plain gradient descent slows down dramatically near saddles, because the gradient becomes small. The stochastic noise of mini-batch SGD usually pushes the optimiser past saddles within reasonable wall-clock time, which is part of why SGD works so well empirically. Second, naive Newton's method actively moves towards saddles: the update $-\mathbf{H}^{-1} \nabla f$ is attracted to any critical point, regardless of whether the Hessian's eigenvalues are positive. Pure Newton on a non-convex loss can therefore stall at a saddle that gradient descent would escape. Practical second-order methods address this with safeguards, trust regions, eigenvalue clipping, or Levenberg–Marquardt damping, but the fact that saddles are the typical critical point of a deep-learning loss is one of the deeper reasons the field defaults to first-order methods.

What you should take away

The Hessian $\mathbf{H}$ is the matrix of second partial derivatives of a scalar function and is symmetric whenever the second derivatives are continuous (Schwarz's theorem).
The eigenvalues of the Hessian classify a critical point: all positive means a minimum, all negative a maximum, mixed signs a saddle, and zeros a degenerate case requiring higher-order tests.
Newton's method uses $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}^{-1} \nabla f$ to leap to the minimum of a local quadratic model. Its quadratic convergence is dramatically faster than gradient descent's linear convergence, but the $O(n^2)$ storage and $O(n^3)$ inversion costs make it impractical for $n$ beyond a few thousand.
Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradients across iterations and dominate classical convex optimisation, but do not handle the noise and scale of stochastic deep learning well.
Modern deep-learning optimisers, Adam, K-FAC, Shampoo, natural gradient, are best understood as cheap approximations to second-order information, recovering some of Newton's speed at a fraction of its cost. The full story of how they work, and why high-dimensional loss landscapes are dominated by saddles rather than minima, is taken up in Chapter 10.