Hessian, Glossary, Textbook of AI

The Hessian of a twice-differentiable scalar function $f : \mathbb{R}^n \to \mathbb{R}$ at a point $x$ is the $n \times n$ matrix of second partial derivatives:

$$\bigl( \nabla^2 f(x) \bigr)_{ij} = \frac{\partial^2 f}{\partial x_i \, \partial x_j}.$$

It is symmetric whenever the second partials are continuous (Schwarz's / Clairaut's theorem). Named after the 19th-century German mathematician Otto Hesse, the Hessian captures the local curvature of $f$: the second-order Taylor expansion is

$$f(x + \delta) \approx f(x) + \delta^\top \nabla f(x) + \tfrac{1}{2} \, \delta^\top \nabla^2 f(x) \, \delta,$$

so the Hessian is the analogue of the second derivative for multivariable functions.

Geometry

The eigenvalues of the Hessian classify the local geometry of $f$ at a critical point (where $\nabla f = 0$):

All eigenvalues positive → strictly positive definite Hessian → strict local minimum.
All eigenvalues negative → negative definite → strict local maximum.
Mixed signs → indefinite → saddle point.
Some zero eigenvalues → degenerate critical point requiring higher-order analysis.

The condition number of the Hessian (ratio of largest to smallest eigenvalue) controls how anisotropic the loss landscape is, a poorly conditioned Hessian forms long narrow valleys that defeat first-order methods.

In optimisation

Newton's method uses the inverse Hessian to take curvature-corrected steps:

$$\theta_{t+1} = \theta_t - \bigl[ \nabla^2 f(\theta_t) \bigr]^{-1} \nabla f(\theta_t),$$

achieving quadratic convergence near a strict local minimum. Quasi-Newton methods, BFGS, L-BFGS, DFP, SR1, approximate the Hessian (or its inverse) iteratively from successive gradient evaluations, recovering most of Newton's speed without paying the full second-derivative cost.

In deep learning

For a model with $n$ parameters, the full Hessian costs $O(n^2)$ memory and $O(n^2)$ to $O(n^3)$ compute, prohibitive when $n > 10^4$ and entirely infeasible at the $n = 10^9$–$10^{12}$ scale of modern neural networks. Two workarounds dominate practice:

Hessian–vector products $H v$ can be computed in $O(n)$ time without ever materialising the full matrix, via Pearlmutter's trick (forward-mode automatic differentiation applied on top of backward-mode). This enables conjugate-gradient solves of $H v = g$ for use in Hessian-free optimisation (Martens, 2010).
Block-diagonal / Kronecker-factored approximations, K-FAC (Martens and Grosse, 2015), Shampoo (Gupta et al., 2018), natural gradient with the Fisher information as a Hessian surrogate, are tractable second-order methods that scale to large neural networks.

The empirical Hessian eigenvalue distribution of trained neural networks is heavily skewed: a small number of large positive eigenvalues sit on top of a near-zero bulk, with negative eigenvalues making the Hessian indefinite even at points of good loss. This implies an effective low rank of curvature, characterises the manifold of loss-equivalent solutions through the Hessian's null space, and motivates much of the modern theory of why over-parameterised networks generalise.

Video

Related terms: Gradient, Chain Rule, Fisher Information

Discussed in:

Chapter 7: Supervised Learning, Optimisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).