The Jacobian Matrix is the generalisation of the gradient and the derivative to vector-valued functions of vector-valued arguments. If $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at a point $\mathbf{x}$, the Jacobian $J_\mathbf{f}(\mathbf{x})$ is the $m \times n$ matrix whose $(i, j)$ entry is
$$J_{ij} = \frac{\partial f_i}{\partial x_j}.$$
Each row of the Jacobian is the gradient of one output component with respect to all inputs; each column records how every output responds to a change in one input. The Jacobian encodes the best linear approximation to $\mathbf{f}$ near $\mathbf{x}$, it is the matrix by which small changes $\Delta \mathbf{x}$ in the input are multiplied to produce (to first order) small changes $\Delta \mathbf{f} \approx J\,\Delta\mathbf{x}$ in the output. The matrix is named after the German mathematician Carl Gustav Jacob Jacobi (1804–1851), who developed the theory of these determinants in the 1830s and 1840s.
The Jacobian plays a starring role in the multivariate chain rule: if $\mathbf{h} = \mathbf{g} \circ \mathbf{f}$, then $J_\mathbf{h}(\mathbf{x}) = J_\mathbf{g}(\mathbf{f}(\mathbf{x})) \cdot J_\mathbf{f}(\mathbf{x})$. Backpropagation in a deep neural network is precisely the iterated application of this rule: each layer contributes a Jacobian, and the overall gradient of the scalar loss with respect to the parameters is computed by repeatedly multiplying these Jacobians together. In practice modern automatic differentiation systems never form the full Jacobian explicitly, for an $n$-input, $m$-output function the Jacobian has $mn$ entries, often hundreds of millions or more in modern networks. Instead they compute vector–Jacobian products $\mathbf{v}^\top J$ (reverse mode, used when $m \ll n$, the typical case for scalar losses) or Jacobian–vector products $J\mathbf{v}$ (forward mode, used when $n \ll m$), each at roughly the cost of a single function evaluation. The Hessian of a scalar function is the Jacobian of its gradient, and the Gauss–Newton matrix $J^\top J$ is a positive-semidefinite approximation to the Hessian of a least-squares loss.
The determinant of the Jacobian, $|\det J_\mathbf{f}|$, measures how the function locally stretches volumes; it appears in the multivariate change-of-variables formula for probability densities, $p_x(\mathbf{x}) = p_z(\mathbf{f}^{-1}(\mathbf{x})) \cdot |\det \partial \mathbf{f}^{-1}/\partial \mathbf{x}|$. Normalising flows exploit this formula directly to build flexible invertible neural networks whose densities are tractable; the design of architectures such as NICE, Real NVP and Glow is largely the design of transformations with triangular or otherwise tractable Jacobian determinants. Jacobians appear elsewhere across AI: in robotics and control they relate joint velocities to end-effector velocities; in sensitivity analysis they quantify how outputs depend on inputs, a fundamental quantity for uncertainty propagation, adversarial robustness, and the analysis of neural tangent kernels.
Video
Related terms: Gradient, Hessian, Backpropagation, Chain Rule, Normalising Flow
Discussed in:
- Chapter 6: ML Fundamentals, Mathematical Foundations