An energy-based model (EBM) assigns each input $x$ a scalar energy $E_\theta(x) \in \mathbb{R}$ via a parametric function (typically a neural network). The energy is converted to a probability via the Boltzmann distribution
$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \qquad Z(\theta) = \int \exp(-E_\theta(x)) \, dx.$$
Low energy means high probability. The partition function $Z(\theta)$ is the normalising constant that makes $p_\theta$ a valid probability distribution; it is typically intractable to compute in high dimensions, and dealing with it is the central computational difficulty of EBMs.
Historical roots
EBMs trace back to statistical mechanics: physical systems at thermal equilibrium occupy states with probability proportional to $\exp(-E/kT)$. Hopfield networks (1982) and Boltzmann machines (Hinton & Sejnowski, 1985) imported this view into AI, giving energy interpretations to associative memory and stochastic neural networks. Restricted Boltzmann Machines (RBMs) and Deep Belief Networks (Hinton, 2006) launched the deep-learning era using EBM machinery.
LeCun and colleagues popularised the modern framing in "A Tutorial on Energy-Based Learning" (LeCun et al., 2006), arguing that EBMs unify generative and discriminative learning under a single mathematical lens.
Training methods that sidestep $Z$
Several techniques avoid computing the intractable partition function:
- Contrastive Divergence (Hinton, 2002): truncated Markov-chain Monte Carlo for approximate maximum likelihood. The $k$-step CD update pushes energy down on data and up on samples obtained by $k$ steps of MCMC starting from data.
- Score Matching (Hyvärinen, 2005): match the score $s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ to the data score, sidestepping $Z$ because $\nabla_x \log Z = 0$. Sliced and denoising score matching make this scalable.
- Noise-Contrastive Estimation (Gutmann & Hyvärinen, 2010): logistic regression that distinguishes data from a known noise distribution, treating $\log Z$ as a learnable scalar.
- Adversarial Training: pit the EBM against a generator network; GANs can be re-derived as implicit EBMs.
Sampling
Sampling from an EBM typically uses Markov-chain Monte Carlo. The standard method is Langevin dynamics:
$$x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, z_t, \qquad z_t \sim \mathcal{N}(0, I),$$
which combines a gradient-descent step on energy with Gaussian noise. Without the noise, the chain would collapse to local energy minima; with appropriate noise, it samples from $p_\theta$ in the limit. Hamiltonian Monte Carlo and annealed importance sampling offer better mixing at higher cost.
Examples
- Hopfield networks and modern Hopfield networks (Ramsauer et al., 2020).
- Boltzmann machines and Restricted Boltzmann Machines.
- Deep EBMs (Du & Mordatch, 2019): convolutional or transformer networks trained via Langevin sampling.
- Score-based generative models (Song & Ermon, 2019; Song et al., 2021).
- Diffusion models: trained at multiple noise levels, they are continuous EBMs in disguise.
Connection to diffusion
The 2020-2024 wave of generative-model success made the EBM-diffusion connection explicit. Diffusion models can be reformulated as score-based models that match $\nabla_x \log p_\sigma(x)$ at multiple noise levels $\sigma$. This is a continuous EBM perspective and unifies VAEs, normalising flows, GANs and diffusion under the energy/score umbrella. Sampling from a diffusion model is precisely Langevin-style SDE integration over a sequence of decreasing noise levels.
Why EBMs matter
EBMs are conceptually attractive because they place no constraints on the form of $E_\theta$, any neural network can be an energy function , whereas autoregressive models, normalising flows and VAEs each impose architectural restrictions for tractable likelihood. The cost is intractable normalisation, but score-based methods have largely solved this in practice. EBMs also provide a clean framework for out-of-distribution detection: high-energy inputs are anomalies.
Video
Related terms: Boltzmann Machine, Hopfield Network, Score Matching, Diffusion Model, Generative Adversarial Network
Discussed in:
- Chapter 9: Neural Networks, Generative Models