Generative Models: 14.8   Energy-based models

Dr Chris Paton

14.8 Energy-based models

An energy-based model, or EBM, is the generative architecture that takes the most direct route to defining a probability distribution and pays the steepest computational price for it. The recipe is one line of physics. Pick any scalar function $E_\theta(\mathbf{x})$, a neural network whose output is a single number, and declare that the probability of $\mathbf{x}$ is proportional to $\exp(-E_\theta(\mathbf{x}))$. Configurations with low energy are likely; configurations with high energy are unlikely; the trough of the energy surface marks the regions where the model believes the data live. The proportionality, however, hides a beast. To turn $\exp(-E_\theta(\mathbf{x}))$ into a normalised density we must divide by the partition function $Z_\theta = \int \exp(-E_\theta(\mathbf{x}))\, d\mathbf{x}$, an integral over the whole input space. For images, $\mathbf{x}$ lives in $\mathbb{R}^{3 \times 1024 \times 1024}$ and $E_\theta$ is a deep convnet. The integral is, for any practical purpose, uncomputable.

This is the central tension. Energy-based models are conceptually the cleanest of the families in this chapter (no encoder, no decoder, no invertibility constraint, no adversary, no Markov chain over latent variables). Just one network that scores configurations. But every operation we want to perform (evaluating a likelihood, drawing a sample, training by maximum likelihood) requires either $Z_\theta$ or samples from $p_\theta$, and both are hard. The story of EBMs is the story of clever ways to avoid computing $Z_\theta$ while still extracting a useful gradient signal from data.

Where do they sit in the landscape of this chapter? Section §14.7 closed with normalising flows, which solve the partition-function problem by construction: each layer is invertible and has a tractable Jacobian, so $Z = 1$ exactly. EBMs go the opposite way, accepting that $Z$ is intractable in exchange for unrestricted architectures. Section §14.9 will introduce diffusion models, which, as we shall see at the end of this section, are essentially a hierarchical EBM trained by score matching at many noise levels, with a sampler that looks suspiciously like Langevin dynamics. The score-matching idea introduced here is therefore not a historical curiosity. It is the technical backbone of the most important generative architecture of the 2020s.

Symbols Used Here

$E_\theta$energy function (scalar neural network output)

$Z_\theta$partition function (intractable normaliser)

$\nabla_\mathbf{x} \log p$score (gradient of log-density with respect to input)

The model

The defining equation of an energy-based model is

$$p_\theta(\mathbf{x}) = \frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \qquad Z_\theta = \int \exp(-E_\theta(\mathbf{x}))\, d\mathbf{x}.$$

The minus sign is a convention borrowed from statistical mechanics, where $E$ is the physical energy of a configuration and $\exp(-E/kT)$ is its Boltzmann weight. Setting $kT = 1$ gives the form above. There is nothing thermodynamic about a generative image model, the analogy is purely formal, but the vocabulary has stuck, and "energy landscape" is a useful mental picture: imagine $E_\theta$ as a hilly terrain over the input space, with valleys at training images and ridges separating them.

The numerator $\exp(-E_\theta(\mathbf{x}))$ is straightforward. Run a forward pass of the network, negate the scalar output, exponentiate. The denominator is the problem. $Z_\theta$ is an integral over a space whose dimension is the number of pixels, or the number of tokens, or whatever the model takes as input. For a $32 \times 32$ greyscale image the integration domain has 1024 dimensions; for a $1024 \times 1024$ colour image, over three million. No quadrature rule survives. Monte Carlo estimation needs samples drawn uniformly from the input space, but in high dimensions almost the entire mass of $\exp(-E_\theta)$ is concentrated on a manifold of measure zero under uniform sampling, so the variance of any naive estimator is astronomical.

The intractability matters because it propagates. Without $Z_\theta$ we cannot evaluate $p_\theta(\mathbf{x})$, only the unnormalised score $-E_\theta(\mathbf{x})$, which lets us compare two configurations but not assign absolute probabilities. We cannot compute the log-likelihood of held-out data, so we cannot report a clean number for "this model achieves $X$ nats per dimension on CIFAR-10" the way an autoregressive model can. We cannot sample directly: there is no sequence of conditional distributions to walk along. Everything has to be done by Markov chain Monte Carlo, which mixes slowly when $E_\theta$ has many local minima, exactly the regime we trained it into. EBMs are, in this sense, a deliberate trade: maximum architectural freedom in exchange for losing every closed-form quantity that other generative families take for granted.

Training methods

Two main strategies sidestep $Z_\theta$ during training, and a third has now absorbed most of the field.

The first is contrastive divergence (Hinton, 2002), which works directly with the maximum-likelihood gradient. The gradient of the log-likelihood of one data point is

$$\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) + \mathbb{E}_{\mathbf{x}' \sim p_\theta}[\nabla_\theta E_\theta(\mathbf{x}')].$$

The first term is a single forward-and-backward pass on the data point, push the energy of $\mathbf{x}$ down. The second term is the expectation of the energy gradient under the model itself, push the energy of typical model samples up. Maximum likelihood is therefore a tug-of-war: pull data into the valleys, push everything else out. The hard part is the expectation, which requires samples from $p_\theta$. CD-$k$ approximates these by running $k$ steps of MCMC starting from the data; even $k = 1$ works surprisingly well in practice. Persistent contrastive divergence (Tieleman, 2008) maintains a long-running chain across mini-batches. CD trained the deep belief networks of the mid-2000s, the work that put deep learning on the map before convolutional networks took over.

The second is score matching (Hyvärinen, 2005), which avoids the partition function by changing what we ask the model to learn. Define the score as $s_\theta(\mathbf{x}) = \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) = -\nabla_\mathbf{x} E_\theta(\mathbf{x})$. The crucial point is that the score is the gradient with respect to the input, not the parameters, and because $\log p_\theta(\mathbf{x}) = -E_\theta(\mathbf{x}) - \log Z_\theta$, the constant $\log Z_\theta$ disappears upon differentiating with respect to $\mathbf{x}$. Match the model score to the data score and there is no partition function to estimate:

$$J_{\text{SM}}(\theta) = \tfrac{1}{2}\mathbb{E}_{p_{\text{data}}}\big[\|s_\theta(\mathbf{x}) - \nabla_\mathbf{x} \log p_{\text{data}}(\mathbf{x})\|^2\big].$$

We do not have $\nabla_\mathbf{x} \log p_{\text{data}}$, that would be the answer we want, but Hyvärinen's integration-by-parts identity rewrites the objective in terms of the model alone:

$$J_{\text{SM}}(\theta) = \mathbb{E}_{p_{\text{data}}}\big[\tfrac{1}{2}\|s_\theta(\mathbf{x})\|^2 + \mathrm{tr}(\nabla_\mathbf{x} s_\theta(\mathbf{x}))\big] + \text{const}.$$

The trace of the input Jacobian is expensive in high dimensions, which limited the practical reach of pure score matching for years.

The third strategy, denoising score matching (Vincent, 2011), is the breakthrough. Instead of matching the score on clean data, perturb $\mathbf{x}$ with Gaussian noise to get $\tilde{\mathbf{x}}$ and ask the network to predict the score of the noisy density. That target has a closed form, it points from $\tilde{\mathbf{x}}$ back towards $\mathbf{x}$, so the trace term vanishes and the objective collapses to a plain regression on noise. This is the same equation diffusion models train against, as the next section makes explicit.

Sampling

Once an EBM is trained, drawing a sample means descending the energy landscape while injecting enough randomness to explore. Langevin dynamics is the canonical recipe:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\, \nabla_\mathbf{x} E_\theta(\mathbf{x}_t) + \sqrt{2\eta}\, \boldsymbol{\epsilon}_t, \qquad \boldsymbol{\epsilon}_t \sim \mathcal{N}(0, \mathbf{I}).$$

The deterministic drift $-\eta \nabla_\mathbf{x} E_\theta$ rolls $\mathbf{x}$ downhill on the energy surface; the stochastic kick $\sqrt{2\eta}\, \boldsymbol{\epsilon}_t$ keeps it from getting stuck. The relative scale of the two terms is fixed by the discretisation of the underlying stochastic differential equation: $\eta$ for the drift and $\sqrt{2\eta}$ for the noise. Under that scaling, the chain is the discretised Langevin SDE, whose stationary distribution is exactly $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$. Run it long enough and the samples are distributed according to the model.

In practice the chain mixes badly. Modern EBMs have multimodal energy landscapes, one valley per training mode, one per dataset cluster, sometimes thousands, and Langevin moves locally. Crossing a ridge between two valleys takes time exponential in the ridge height. A chain initialised in one mode rarely visits another within a reasonable budget. Several remedies exist. Replica-exchange (parallel tempering) runs multiple chains at different effective temperatures and swaps configurations between them, letting hotter chains tunnel through ridges and cooler chains exploit valleys. Annealed Langevin, introduced for score-based models by Song and Ermon (2019), starts at high noise where the landscape is smooth and gradually reduces the noise, exactly the sampler diffusion models will use. Metropolis-adjusted Langevin (MALA) corrects the discretisation bias by accepting or rejecting each proposed step.

A practical note: the same gradient $\nabla_\mathbf{x} E_\theta$ used for sampling is also used inside the training loop. Each minibatch step typically interleaves a short MCMC chain to refresh negative samples. EBM training is therefore much slower per step than VAE or GAN training, and sensitive to a long list of nuisance hyperparameters: chain length, step size, replay buffer policy, regularisation on $E$ to prevent it from running off to $-\infty$ on data points.

Connection to diffusion

The connection that recasts the whole field is this: diffusion models are energy-based models, trained by denoising score matching at many noise scales, with annealed Langevin dynamics as the sampler. Each level of a diffusion model defines an implicit energy $E_t(\mathbf{x}) = -\log p_t(\mathbf{x})$ for the noise-corrupted distribution at time $t$, and the learned network outputs the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$, equivalently, $-\nabla_\mathbf{x} E_t(\mathbf{x})$. Reverse-time sampling is Langevin descent on a sequence of these energies, from coarse and smooth at high noise to sharp and detailed at low noise. The coarse-to-fine schedule sidesteps the mixing problem that crippled single-level EBMs: when the landscape is smoothed by enough noise, all modes are reachable; as the noise anneals away, the chain refines towards a single well.

Song and Ermon (2019) made the equivalence explicit in Generative Modeling by Estimating Gradients of the Data Distribution and the score-based SDE framework that grew from it (§14.13). DDPM and score-based generative models converge to the same algorithm from two starting points, variational lower bounds on one side, score matching on the other, and both views illuminate it.

Where EBMs live in 2026

Pure energy-based models are, today, mostly an object of theoretical and research interest. Diffusion has eaten their image-generation niche, normalising flows handle density estimation where invertibility is bearable, and diffusion-style score networks trained at multiple noise scales have absorbed most of the empirical advantages EBMs once promised. JEM (Joint Energy-based Models, Grathwohl et al., 2020) showed a classifier could double as an EBM, and EBMs remain attractive in inverse-problem and out-of-distribution settings where having a calibrated $E$ surface is genuinely useful. But the score-matching idea itself is everywhere: in DDPM, in score SDEs, in protein design, in robot policy learning. The framework persists by having dissolved into the rest of the field.

What you should take away

An energy-based model defines $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$ with $E_\theta$ any scalar neural network; the partition function $Z_\theta$ is intractable in high dimensions.
Maximum likelihood training (contrastive divergence) requires samples from $p_\theta$, obtained by short MCMC chains; this works but is slow and brittle.
Score matching avoids $Z_\theta$ entirely by matching $\nabla_\mathbf{x} \log p$ instead of $\log p$, since the partition constant vanishes upon differentiation with respect to $\mathbf{x}$.
Sampling is performed by Langevin dynamics, which mixes poorly when the energy landscape is multimodal, a weakness solved by annealing the noise level.
Diffusion models are EBMs in disguise: denoising score matching across many noise scales, sampled by annealed Langevin. The conceptual debt is large even though pure EBMs are no longer the workhorse architecture.