Restricted Boltzmann Machine, Glossary, Textbook of AI

A restricted Boltzmann machine (RBM) is a Boltzmann machine whose connections are restricted to a bipartite graph: every connection runs between a visible layer $\mathbf{v} \in \{0,1\}^m$ and a hidden layer $\mathbf{h} \in \{0,1\}^n$, with no within-layer connections. The joint distribution is the Boltzmann (Gibbs) distribution

$$p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp(-E(\mathbf{v}, \mathbf{h})),$$

with energy

$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{a}^\top \mathbf{v} - \mathbf{b}^\top \mathbf{h} - \mathbf{v}^\top W \mathbf{h},$$

where $\mathbf{a}, \mathbf{b}$ are bias vectors, $W$ a weight matrix and $Z = \sum_{\mathbf{v},\mathbf{h}} \exp(-E)$ the partition function.

Why "restricted" makes inference tractable

The bipartite structure has a crucial consequence: given the visible layer, the hidden units are conditionally independent, and vice versa:

$$p(h_j = 1 \mid \mathbf{v}) = \sigma(b_j + \sum_i W_{ij} v_i), \quad p(v_i = 1 \mid \mathbf{h}) = \sigma(a_i + \sum_j W_{ij} h_j),$$

where $\sigma$ is the logistic sigmoid. Block Gibbs sampling therefore alternates between the two layers in a single parallel step each , far cheaper than the unit-by-unit Gibbs chain required by an unrestricted Boltzmann machine.

Contrastive divergence

The full maximum-likelihood gradient requires sampling from the model distribution, which is intractable. Geoffrey Hinton's contrastive divergence (CD-$k$) algorithm (2002) gave RBMs a fast, approximate learning rule by replacing the long Gibbs chain with just $k$ steps (often $k = 1$) starting from each training example. The update is

$$\Delta W_{ij} \propto \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}},$$

the difference between empirical and reconstructed correlations. Persistent CD (PCD) maintains a persistent Markov chain across updates for better gradient estimation.

Deep belief networks and the deep-learning revival

The 2006 paper by Hinton, Osindero and Teh, A Fast Learning Algorithm for Deep Belief Nets, stacked RBMs to form deep belief networks (DBNs): each layer's hidden units became the next layer's visible units, trained greedily layer by layer. This unsupervised pretraining of deep architectures, followed by a fine-tuning supervised pass, was widely credited with launching the modern deep-learning era. Salakhutdinov and Hinton's deep Boltzmann machines (2009) extended the approach.

Modern relevance

RBMs and DBNs were eventually displaced as the building blocks of choice by purely supervised training of deep convolutional networks, the path AlexNet (2012) demonstrated and that has dominated since, powered by ReLU activations, dropout, batch normalisation and large GPU-accelerated datasets. RBMs remain important historically and conceptually: the energy-based, sampling-based view of learning that they embody resurfaces in modern diffusion models, score-based generative models, EBMs for language and even physics-inspired sampling schemes such as annealed importance sampling. Quantum computing literature uses RBMs as ansatze for many-body wavefunctions.

Discussed in:

Chapter 9: Neural Networks, A Brief History of Deep Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).