Glossary

Energy-Based Model

An energy-based model (EBM) assigns each input $x$ a scalar energy $E_\theta(x) \in \mathbb{R}$ via a parametric function (typically a neural network). The energy is converted to a probability via the Boltzmann distribution

$$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \qquad Z(\theta) = \int \exp(-E_\theta(x)) \, dx.$$

Low energy means high probability. The partition function $Z(\theta)$ is the normalising constant that makes $p_\theta$ a valid probability distribution; it is typically intractable to compute in high dimensions, and dealing with it is the central computational difficulty of EBMs.

Historical roots

EBMs trace back to statistical mechanics: physical systems at thermal equilibrium occupy states with probability proportional to $\exp(-E/kT)$. Hopfield networks (1982) and Boltzmann machines (Hinton & Sejnowski, 1985) imported this view into AI, giving energy interpretations to associative memory and stochastic neural networks. Restricted Boltzmann Machines (RBMs) and Deep Belief Networks (Hinton, 2006) launched the deep-learning era using EBM machinery.

LeCun and colleagues popularised the modern framing in "A Tutorial on Energy-Based Learning" (LeCun et al., 2006), arguing that EBMs unify generative and discriminative learning under a single mathematical lens.

Training methods that sidestep $Z$

Several techniques avoid computing the intractable partition function:

  • Contrastive Divergence (Hinton, 2002): truncated Markov-chain Monte Carlo for approximate maximum likelihood. The $k$-step CD update pushes energy down on data and up on samples obtained by $k$ steps of MCMC starting from data.
  • Score Matching (Hyvärinen, 2005): match the score $s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ to the data score, sidestepping $Z$ because $\nabla_x \log Z = 0$. Sliced and denoising score matching make this scalable.
  • Noise-Contrastive Estimation (Gutmann & Hyvärinen, 2010): logistic regression that distinguishes data from a known noise distribution, treating $\log Z$ as a learnable scalar.
  • Adversarial Training: pit the EBM against a generator network; GANs can be re-derived as implicit EBMs.

Sampling

Sampling from an EBM typically uses Markov-chain Monte Carlo. The standard method is Langevin dynamics:

$$x_{t+1} = x_t - \frac{\eta}{2} \nabla_x E_\theta(x_t) + \sqrt{\eta} \, z_t, \qquad z_t \sim \mathcal{N}(0, I),$$

which combines a gradient-descent step on energy with Gaussian noise. Without the noise, the chain would collapse to local energy minima; with appropriate noise, it samples from $p_\theta$ in the limit. Hamiltonian Monte Carlo and annealed importance sampling offer better mixing at higher cost.

Examples

  • Hopfield networks and modern Hopfield networks (Ramsauer et al., 2020).
  • Boltzmann machines and Restricted Boltzmann Machines.
  • Deep EBMs (Du & Mordatch, 2019): convolutional or transformer networks trained via Langevin sampling.
  • Score-based generative models (Song & Ermon, 2019; Song et al., 2021).
  • Diffusion models: trained at multiple noise levels, they are continuous EBMs in disguise.

Connection to diffusion

The 2020-2024 wave of generative-model success made the EBM-diffusion connection explicit. Diffusion models can be reformulated as score-based models that match $\nabla_x \log p_\sigma(x)$ at multiple noise levels $\sigma$. This is a continuous EBM perspective and unifies VAEs, normalising flows, GANs and diffusion under the energy/score umbrella. Sampling from a diffusion model is precisely Langevin-style SDE integration over a sequence of decreasing noise levels.

Why EBMs matter

EBMs are conceptually attractive because they place no constraints on the form of $E_\theta$, any neural network can be an energy function , whereas autoregressive models, normalising flows and VAEs each impose architectural restrictions for tractable likelihood. The cost is intractable normalisation, but score-based methods have largely solved this in practice. EBMs also provide a clean framework for out-of-distribution detection: high-energy inputs are anomalies.

Video

Related terms: Boltzmann Machine, Hopfield Network, Score Matching, Diffusion Model, Generative Adversarial Network

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).