Energy-Based Model, Glossary, Textbook of AI

An energy-based model (EBM) assigns each input $x$ a scalar energy $E_\theta(x) \in \mathbb{R}$ via a parametric function (typically a neural network). The energy is converted to a probability via the Boltzmann distribution

$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \qquad Z(\theta) = \int \exp(-E_\theta(x)) \, dx.$$

Low energy means high probability. The partition function $Z(\theta)$ is the normalising constant that makes $p_\theta$ a valid probability distribution; it is typically intractable to compute in high dimensions, and dealing with it is the central computational difficulty of EBMs.

Historical roots

EBMs trace back to statistical mechanics: physical systems at thermal equilibrium occupy states with probability proportional to $\exp(-E/kT)$. Hopfield networks (1982) and Boltzmann machines (Hinton & Sejnowski, 1985) imported this view into AI, giving energy interpretations to associative memory and stochastic neural networks. Restricted Boltzmann Machines (RBMs) and Deep Belief Networks (Hinton, 2006) launched the deep-learning era using EBM machinery.

LeCun and colleagues popularised the modern framing in "A Tutorial on Energy-Based Learning" (LeCun et al., 2006), arguing that EBMs unify generative and discriminative learning under a single mathematical lens.

Training methods that sidestep $Z$

Several techniques avoid computing the intractable partition function:

Contrastive Divergence (Hinton, 2002): truncated Markov-chain Monte Carlo for approximate maximum likelihood. The $k$-step CD update pushes energy down on data and up on samples obtained by $k$ steps of MCMC starting from data.
Score Matching (Hyvärinen, 2005): match the score $s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ to the data score, sidestepping $Z$ because $\nabla_x \log Z = 0$. Sliced and denoising score matching make this scalable.
Noise-Contrastive Estimation (Gutmann & Hyvärinen, 2010): logistic regression that distinguishes data from a known noise distribution, treating $\log Z$ as a learnable scalar.
Adversarial Training: pit the EBM against a generator network; GANs can be re-derived as implicit EBMs.

Sampling

Sampling from an EBM typically uses Markov-chain Monte Carlo. The standard method is Langevin dynamics:

$$x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, z_t, \qquad z_t \sim \mathcal{N}(0, I),$$

which combines a gradient-descent step on energy with Gaussian noise. Without the noise, the chain would collapse to local energy minima; with appropriate noise, it samples from $p_\theta$ in the limit. Hamiltonian Monte Carlo and annealed importance sampling offer better mixing at higher cost.

Examples

Hopfield networks and modern Hopfield networks (Ramsauer et al., 2020).
Boltzmann machines and Restricted Boltzmann Machines.
Deep EBMs (Du & Mordatch, 2019): convolutional or transformer networks trained via Langevin sampling.
Score-based generative models (Song & Ermon, 2019; Song et al., 2021).
Diffusion models: trained at multiple noise levels, they are continuous EBMs in disguise.

Connection to diffusion

The 2020-2024 wave of generative-model success made the EBM-diffusion connection explicit. Diffusion models can be reformulated as score-based models that match $\nabla_x \log p_\sigma(x)$ at multiple noise levels $\sigma$. This is a continuous EBM perspective and unifies VAEs, normalising flows, GANs and diffusion under the energy/score umbrella. Sampling from a diffusion model is precisely Langevin-style SDE integration over a sequence of decreasing noise levels.

Why EBMs matter

EBMs are conceptually attractive because they place no constraints on the form of $E_\theta$, any neural network can be an energy function , whereas autoregressive models, normalising flows and VAEs each impose architectural restrictions for tractable likelihood. The cost is intractable normalisation, but score-based methods have largely solved this in practice. EBMs also provide a clean framework for out-of-distribution detection: high-energy inputs are anomalies.

Video

Discussed in:

Chapter 9: Neural Networks, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).