Contrastive divergence (CD) is the approximate maximum-likelihood learning algorithm for energy-based models introduced by Geoffrey Hinton in 2002. The exact maximum-likelihood gradient requires sampling from the model distribution, which in general requires running a Markov chain to convergence, computationally infeasible for non-trivial models. CD truncates the Markov chain after a small number of steps (often just one), starting from the data distribution rather than equilibrium.
Specifically, CD-k for a Boltzmann machine performs k Gibbs-sampling steps starting from a training example to obtain a "negative phase" sample, then computes the gradient as the difference between expectations under the data distribution and under the truncated chain. The bias of the resulting gradient estimate is provably small for many model classes and the algorithm converges to good approximate maximum-likelihood solutions in practice.
CD-1 (single Gibbs step) is the version most widely used. Its successful application to restricted Boltzmann machines enabled the deep belief network results of 2006 and the broader pre-training renaissance of the 2000s. Persistent contrastive divergence (Tieleman, 2008) and parallel tempering (Desjardins et al., 2010) refined the algorithm further.
In the deep-learning era CD has been displaced by direct backpropagation through differentiable architectures, but its conceptual descendants , score matching, denoising score matching, noise-contrastive estimation, underlie modern energy-based models and diffusion models.
Related terms: Boltzmann Machine, Restricted Boltzmann Machine, Deep Belief Network, geoffrey-hinton
Discussed in:
- Chapter 9: Neural Networks, Generative Models