Contrastive Divergence (mathematical detail), Glossary, Textbook of AI

For an energy-based model $p_\theta(x) = e^{-E_\theta(x)} / Z(\theta)$ with intractable partition function $Z(\theta) = \sum_x e^{-E_\theta(x)}$, the maximum-likelihood gradient is

$$\nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x) - \mathbb{E}_{x' \sim p_\theta}[-\nabla_\theta E_\theta(x')]$$

The first term (the positive phase) is computed at the data point. The second term (the negative phase) requires expectations under the model, i.e., samples from $p_\theta$, which generally requires running an MCMC chain to convergence.

Contrastive divergence (CD-$k$) approximates the negative phase by truncating the Markov chain after $k$ Gibbs steps starting from the data:

$$x \to x^{(1)} \to x^{(2)} \to \ldots \to x^{(k)}$$

The update is then

$$\Delta \theta \propto -\nabla_\theta E_\theta(x) + \nabla_\theta E_\theta(x^{(k)})$$

CD-1 (single Gibbs step) is the version most widely used. The bias of the resulting gradient estimate is provably small for many model classes and the algorithm converges to good approximate maximum-likelihood solutions in practice, though strictly speaking it minimises a contrastive divergence objective rather than the true KL divergence.

For a restricted Boltzmann machine (RBM), the bipartite structure makes Gibbs sampling efficient: alternate sampling visible-given-hidden and hidden-given-visible. The CD-1 update for connection $w_{ij}$ between visible $v_i$ and hidden $h_j$ is

$$\Delta w_{ij} = \eta (\langle v_i h_j \rangle_\mathrm{data} - \langle v_i h_j \rangle_\mathrm{recon})$$

where $\langle \cdot \rangle_\mathrm{data}$ is the correlation under the data-clamped network and $\langle \cdot \rangle_\mathrm{recon}$ is the correlation after one Gibbs step.

Persistent contrastive divergence (PCD) (Tieleman 2008) maintains the Markov chain across training updates rather than restarting from data each time. PCD can mix to lower-energy regions of the model and often gives better sample quality than vanilla CD.

Parallel tempering CD (Desjardins et al., 2010) runs multiple chains at different temperatures and swaps states between them, improving exploration in models with multiple modes.

Modern energy-based models, score-matching methods, and diffusion models are conceptual descendants of CD: all train models to discriminate data from progressively perturbed versions, sidestepping the partition function.

Discussed in:

Chapter 9: Neural Networks, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).