Fisher Information, Glossary, Textbook of AI

The Fisher information of a parametric model $p_\theta(x)$ at parameter $\theta$ is

$$\mathcal{I}(\theta) = \mathbb{E}_{x \sim p_\theta}\!\left[\nabla_\theta \log p_\theta(x) \nabla_\theta \log p_\theta(x)^\top\right]$$

For scalar $\theta$: $\mathcal{I}(\theta) = \mathbb{E}[(\partial_\theta \log p_\theta(x))^2]$.

Equivalent formulation (under regularity conditions):

$$\mathcal{I}(\theta) = -\mathbb{E}_{x \sim p_\theta}\!\left[\nabla_\theta^2 \log p_\theta(x)\right]$$

The negative expected Hessian of the log-likelihood.

The Fisher information measures the expected curvature of the log-likelihood near $\theta$, how sharply the log-likelihood depends on $\theta$, and equivalently how informative a single observation is about the true parameter.

Cramér–Rao lower bound: for any unbiased estimator $\hat\theta$ of $\theta$,

$$\mathrm{Var}(\hat\theta) \geq \mathcal{I}(\theta)^{-1}$$

The Fisher information's inverse is the minimum achievable variance, a theoretical limit on estimation accuracy. The maximum-likelihood estimator asymptotically achieves this bound, making it asymptotically efficient.

MLE asymptotic distribution: as $N \to \infty$,

$$\sqrt{N}(\hat\theta_{\mathrm{MLE}} - \theta) \to \mathcal{N}(0, \mathcal{I}(\theta)^{-1})$$

This justifies confidence intervals and hypothesis tests based on Fisher information.

In modern AI / ML:

Natural gradient (Amari 1998) uses the Fisher information matrix as a metric on parameter space. The natural gradient update is

$$\theta_{t+1} = \theta_t - \eta \mathcal{I}(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)$$

Invariant to reparameterisation (unlike vanilla gradient descent) and provably more sample-efficient for many problems. The Fisher matrix is too expensive to compute exactly, motivating approximations:

K-FAC (Kronecker-Factored Approximate Curvature) (Martens & Grosse 2015): approximates the Fisher matrix as a Kronecker product of layer-wise factors. Used in some large-scale ML systems.

Empirical Fisher: replace the expectation under the model with an empirical average over data:

$$\hat{\mathcal{I}}(\theta) = \frac{1}{N} \sum_n \nabla \log p_\theta(x_n) \nabla \log p_\theta(x_n)^\top$$

Often used as a more practical alternative to the true Fisher.

Information geometry: views parameter space as a Riemannian manifold with the Fisher matrix as metric. Provides a geometric framework for understanding statistical estimation.

EWC (Elastic Weight Consolidation) (Kirkpatrick et al. 2017) for continual learning: penalises changes to weights important to previous tasks, with importance measured by diagonal Fisher information. Prevents catastrophic forgetting.

The Fisher information also underlies Bayesian model selection (BIC), influence functions for understanding training-example impact, and information-geometric analyses of optimisation in over-parameterised models.

Related terms: Maximum Likelihood Estimation, Hessian

Discussed in:

Chapter 5: Statistics, Statistics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).