Partial Derivative, Glossary, Textbook of AI

A partial derivative of a multivariable function $f(x_1, \ldots, x_n)$ with respect to $x_i$, written $\frac{\partial f}{\partial x_i}$ or $\partial_i f$, is the derivative taken while treating all other variables as constants. It measures the instantaneous rate of change of $f$ along the $i$-th coordinate axis alone, and is defined formally as the limit

$$\frac{\partial f}{\partial x_i}(x) = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_n)}{h}.$$

Partial derivatives are computed using exactly the same rules as ordinary derivatives, chain rule, product rule, quotient rule, applied to one variable at a time. For example, if $f(x, y) = x^2 y + \sin(xy)$, then $\frac{\partial f}{\partial x} = 2xy + y\cos(xy)$ while $\frac{\partial f}{\partial y} = x^2 + x\cos(xy)$.

History

Partial derivatives were introduced by Alexis Clairaut and Leonhard Euler in the early eighteenth century in their work on hydrodynamics and continuum mechanics. The notation $\partial$ (a variant of the Cyrillic "de") was popularised by Adrien-Marie Legendre and Carl Gustav Jacob Jacobi in the early nineteenth century to distinguish partial from total differentiation. Clairaut's theorem, that mixed partials commute, $\partial^2 f / \partial x \partial y = \partial^2 f / \partial y \partial x$ when $f$ is sufficiently smooth, is one of the foundational results.

The gradient and Hessian

Stacking the partial derivatives of a scalar-valued $f: \mathbb{R}^n \to \mathbb{R}$ into a column vector yields the gradient:

$$\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)^\top.$$

Stacking second partials yields the Hessian matrix $H_{ij} = \partial^2 f / \partial x_i \partial x_j$, which by Clairaut's theorem is symmetric. For a vector-valued $f: \mathbb{R}^n \to \mathbb{R}^m$, the matrix of partial derivatives $J_{ij} = \partial f_i / \partial x_j$ is the Jacobian.

Modern relevance: machine learning

In machine learning every parameter in a model has its own partial derivative of the loss function, and collectively these form the gradient used by optimisers such as stochastic gradient descent and Adam. For a neural network with hundreds of millions of weights, computing the loss's partial derivative with respect to each weight individually would be hopelessly inefficient; backpropagation computes them all simultaneously in a single pass by propagating partial derivatives backward through the computational graph using the chain rule. Modern autograd systems (PyTorch's autograd, JAX's grad, TensorFlow's GradientTape) automate this entirely, requiring users to specify only the forward computation.

Partial derivatives also arise in probability theory, where joint densities are differentiated with respect to individual variables, and in optimisation theory, where the Karush–Kuhn–Tucker (KKT) conditions use partial derivatives of the Lagrangian to characterise constrained optima. The score function $\nabla_\theta \log p(x; \theta)$ is a vector of partial derivatives that powers maximum likelihood estimation, variational inference and modern score-based diffusion models. Mastery of partial derivatives is a prerequisite to every deep-learning framework's autograd system, every variational inference derivation, and every theoretical analysis of gradient-based learning.

Interactive

The gradient as a vector field. Tiny arrows on a contour plot point uphill, the steepest ascent direction at every location.

Partial derivatives slice a surface. Hold one variable, slope along the other. Two partials make the gradient.

Video

Related terms: Gradient, Backpropagation, Jacobian, Hessian, Chain Rule

Discussed in:

Chapter 6: ML Fundamentals, Calculus for Machine Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.