Score Matching, Glossary, Textbook of AI

Score matching trains a model to estimate the score function $\nabla_x \log p(x)$, the gradient of the log-density, rather than the density itself. The score function is invariant to multiplicative constants in $p$, so score-matching avoids the need to compute the (often intractable) normalising constant.

The score-matching objective is to minimise the expected squared difference between model and true score:

$$J(\theta) = \frac{1}{2} \mathbb{E}_{x \sim p_\mathrm{data}}\!\left[\|s_\theta(x) - \nabla_x \log p_\mathrm{data}(x)\|^2\right]$$

The true score $\nabla \log p_\mathrm{data}$ is unknown, but Hyvärinen (2005) showed by integration by parts that this objective is equivalent (up to a constant) to

$$J(\theta) = \mathbb{E}_{x \sim p_\mathrm{data}}\!\left[\frac{1}{2} \|s_\theta(x)\|^2 + \mathrm{tr}(\nabla_x s_\theta(x))\right]$$

which involves only the model and data, no knowledge of $p_\mathrm{data}$ required.

Sliced score matching (Song et al. 2020) avoids the expensive trace computation by using random projections.

Denoising score matching (Vincent 2011), the variant most relevant to modern AI, adds noise to the data and trains the model to predict the score of the noisy distribution:

$$J_\sigma(\theta) = \mathbb{E}_{x \sim p_\mathrm{data}, \tilde x \sim \mathcal{N}(x, \sigma^2 I)}\!\left[\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x | x)\|^2\right]$$

For Gaussian noise, $\nabla \log q_\sigma(\tilde x | x) = -(\tilde x - x)/\sigma^2$, so the objective becomes

$$J_\sigma(\theta) = \mathbb{E}\!\left[\Big\|s_\theta(\tilde x) + \frac{\tilde x - x}{\sigma^2}\Big\|^2\right]$$

, train the model to predict the noise that was added.

Score-based generative models (Song & Ermon 2019) sample by Langevin dynamics: starting from noise, iteratively

$$x_{t+1} = x_t + \frac{\delta}{2} s_\theta(x_t) + \sqrt{\delta} \, z_t, \quad z_t \sim \mathcal{N}(0, I)$$

converging to samples from $p_\theta$ as $\delta \to 0$.

Connection to diffusion models: Song & Ermon's score-based models and Ho et al.'s DDPM are mathematically equivalent, diffusion models are score matching at multiple noise levels, with the model predicting noise (equivalently, score) at each level. The unified perspective is given by Song et al.'s 2021 Score-Based Generative Modeling through Stochastic Differential Equations paper, which formalises both as different discretisations of the same continuous-time SDE.

The score-matching framework provides the mathematical foundation for modern diffusion models, score-based image generation, and energy-based language models.

Related terms: Diffusion Model, Energy-Based Model

Discussed in:

Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).