3.11 A brief look at the calculus of variations
So far in this chapter we have asked, again and again, the same kind of question: given a function $f$ that takes a vector $\mathbf{x}$ and returns a number, which choice of $\mathbf{x}$ makes $f(\mathbf{x})$ as small (or as large) as possible? Gradient descent, Newton's method, the Hessian, all of the machinery in §§3.4–3.10 is built to answer that finite-dimensional question. The calculus of variations is the natural next step. It asks the same question one level up: instead of choosing a vector to minimise a number, we choose an entire function to minimise a number.
This sounds abstract, and in a sense it is, but the payoff is immediate. The evidence lower bound (ELBO) used to train variational autoencoders, the free-energy objectives that appear in modern probabilistic models, the policy objectives in some reinforcement-learning algorithms, and the maximum-entropy derivations of softmax and Gaussian distributions are all variational arguments. You will not need to solve any new differential equations to read the rest of this book, but you will need to recognise the shape of the argument when it appears. That recognition is what this section is for.
This section sketches the infinite-dimensional cousin of the finite-dimensional calculus from §§3.4–3.10. The language reappears as a working tool in the chapters on probabilistic models (variational inference, VAEs, diffusion) and reinforcement learning.
From vectors to functions
In ordinary calculus we minimise $f(\mathbf{x})$ over $\mathbf{x} \in \mathbb{R}^n$. The unknown is a list of $n$ numbers; the objective is a real-valued function of those numbers; finding the minimum is a finite-dimensional problem. In the calculus of variations the unknown is itself a function, a recipe that turns inputs into outputs, and the objective is a functional $\mathcal{F}[q]$, a real number that depends on the whole function $q$. The square brackets are a small but useful piece of notation: they remind us that the input is not a number or a vector but an entire function, with infinitely many degrees of freedom.
Three classical problems make the idea concrete.
Shortest path between two points. Suppose you have to draw a curve from $(a, q(a))$ to $(b, q(b))$ in the plane. The length of any candidate curve $q$ is the functional $$\mathcal{F}[q] = \int_a^b \sqrt{1 + q'(x)^2}\, dx.$$ The unknown is the function $q$, with the endpoints fixed. The minimiser, unsurprisingly, is the straight line between them, but notice that the question itself is variational: we are searching the space of all admissible curves, not the space of vectors.
The brachistochrone. In 1696 Johann Bernoulli posed a famous puzzle: a bead slides without friction down a wire from one fixed point to a lower fixed point, pulled by gravity. What shape of wire gets the bead from start to finish in the shortest time? The naive guess is a straight line, but a steeper initial drop builds up speed early and wins. The answer turns out to be a cycloid, the curve traced by a point on the rim of a rolling wheel. The brachistochrone problem launched the calculus of variations as a discipline; it is the first place where mathematicians realised that minimising over curves was a coherent and powerful idea.
The maximum-entropy distribution given mean and variance. Suppose you know that a random variable has mean $\mu$ and variance $\sigma^2$, but nothing else. Of all probability densities $q$ consistent with that information, which one is, in a precise sense, the least committal? The answer is the one that maximises the entropy $\mathcal{H}[q] = -\int q(x) \log q(x)\, dx$ subject to the constraints $\int q = 1$, $\int x\, q(x)\, dx = \mu$, and $\int x^2\, q(x)\, dx = \mu^2 + \sigma^2$. The unique maximiser is the Gaussian $\mathcal{N}(\mu, \sigma^2)$. This is one of several reasons the Gaussian is the default distribution in machine learning: among all densities matching a given mean and variance, it is the one that adds no further structure.
Each of these is a calculus-of-variations problem. The unknown is a function; the objective is a number; the search is over an infinite-dimensional space.
The Euler–Lagrange equation
To solve a finite-dimensional minimisation problem we set the gradient to zero. The variational analogue is the Euler–Lagrange equation. For a functional of the form $$\mathcal{F}[q] = \int_a^b L(x, q, q')\, dx,$$ where $L$, the Lagrangian, is some smooth function of the input $x$, the value $q(x)$, and the derivative $q'(x)$, any minimising $q$ must satisfy $$\frac{\partial L}{\partial q} - \frac{d}{dx}\frac{\partial L}{\partial q'} = 0.$$
The derivation, sketched roughly, mirrors the finite-dimensional argument. Perturb $q$ by a small amount $\varepsilon\, \eta(x)$, where $\eta$ vanishes at the endpoints. Compute $\mathcal{F}[q + \varepsilon\, \eta]$, expand to first order in $\varepsilon$, integrate by parts to remove $\eta'$, and demand that the linear coefficient vanish for every admissible $\eta$. The fundamental lemma of the calculus of variations then forces the integrand multiplying $\eta$ to be zero pointwise, and that integrand is precisely the Euler–Lagrange expression.
The equation is an ordinary differential equation for $q(x)$. For the shortest-path Lagrangian $L = \sqrt{1 + q'^2}$ it reduces to $q'' = 0$, whose solutions are straight lines. For the brachistochrone it gives the cycloid. The same mechanical recipe, write the Lagrangian, take the two partial derivatives, subtract a total derivative, set to zero, works in every classical example. For machine learning, the Euler–Lagrange equation matters less as a calculation tool than as a reminder that "set the functional derivative to zero" is a well-defined operation with a concrete answer.
Application: the ELBO
Variational inference is the workhorse application in machine learning. Many probabilistic models contain a latent variable $z$, for example, the unobserved category of a data point, or the compressed code in a variational autoencoder. Given an observation $x$, Bayes' rule tells us that the posterior $p(z \mid x) = p(x, z) / p(x)$ describes everything we know about $z$ after seeing $x$. In useful models the marginal $p(x) = \int p(x, z)\, dz$ is intractable, and so the posterior itself is intractable. We cannot compute it.
Variational inference replaces the intractable posterior with a tractable approximation $q(z)$ chosen from a friendly family of distributions, and asks: which $q$ is closest, in the Kullback–Leibler sense, to the true posterior? A short calculation rearranges $\mathrm{KL}(q\,\Vert\, p(\,\cdot \mid x))$ into the form $$\log p(x) = \mathrm{ELBO}[q] + \mathrm{KL}\big(q(z)\,\Vert\, p(z \mid x)\big),$$ where the Evidence Lower Bound is the functional $$\mathrm{ELBO}[q] = \mathbb{E}_{q}[\log p(x, z)] + \mathcal{H}[q].$$
Since the KL term is non-negative and the left-hand side does not depend on $q$, maximising the ELBO over $q$ is the same as minimising the KL divergence to the true posterior. We have replaced an intractable Bayesian inversion with an optimisation problem over a function.
The optimisation is, in principle, variational: we are choosing the function $q$ that maximises a functional. In practice we shrink the search space by restricting $q$ to a parametric family $q(z; \phi)$, a Gaussian, say, with mean and covariance produced by a neural network, and optimise over the parameter vector $\phi$ using ordinary gradient descent. The variational nature of the problem nonetheless shapes the derivation: the ELBO appears, and decomposes into a reconstruction term plus an entropy term, precisely because we set up the problem as minimisation over functions before collapsing to parameters.
Application: maximum entropy
A second pattern that recurs throughout AI is the maximum-entropy derivation. The setup is always the same. We seek a probability distribution $q$ that satisfies a list of moment constraints, the integral of certain features against $q$ should equal certain target values, and we want to pick the least informative such distribution, that is, the one with the largest entropy $\mathcal{H}[q]$. The Lagrangian combines the entropy with each constraint, weighted by a multiplier; setting the functional derivative $\delta \mathcal{L}/\delta q$ to zero and solving gives an exponential family of the form $q(x) \propto \exp\big(\sum_k \lambda_k\, T_k(x)\big)$. The multipliers $\lambda_k$ are then fixed by the constraints.
This is the deepest reason exponential families are everywhere in machine learning. The Gaussian comes from a mean-and-variance constraint. The Bernoulli and categorical come from probability-summing-to-one constraints with feature indicators. The softmax classifier is the maximum-entropy distribution over class labels given a linear scoring function, which is why cross-entropy is its natural loss. The Boltzmann distribution in energy-based models is the maximum-entropy distribution at fixed expected energy. The argument is variational, the answer is exponential, and the same machinery powers reasoning about partition functions, free energy, and equilibrium throughout the literature.
What you should take away
- The calculus of variations is ordinary calculus with one substitution: the unknown is a function, not a vector, and the objective is a functional, not a function.
- The Euler–Lagrange equation $\partial L/\partial q - (d/dx)(\partial L/\partial q') = 0$ is the variational analogue of "set the gradient to zero" and gives an ordinary differential equation for the optimal $q$.
- Variational inference recasts intractable Bayesian posteriors as the optimisation of the ELBO over a function $q$; in practice we restrict to a parametric family and use ordinary gradient descent.
- Maximum-entropy arguments derive exponential-family distributions, including the Gaussian, the categorical and the softmax, as the unique least-informative choices given moment constraints.
- You do not need to solve variational problems by hand to read the rest of this book, but recognising the shape of the argument, functional in, function out, functional derivative set to zero, will make later chapters on probabilistic and energy-based models read as natural extensions of the calculus you already know.