Glossary

PAC-Bayes

PAC-Bayes unifies probably approximately correct (PAC) learning with Bayesian-style posterior reasoning. Introduced by McAllester (1999), the framework places a data-independent prior $P$ over a hypothesis class $\mathcal{H}$ and considers any data-dependent posterior $Q$. Rather than bounding the risk of a single hypothesis, PAC-Bayes bounds the expected risk under $Q$.

The classical McAllester bound states that for any $\delta \in (0, 1)$, with probability at least $1 - \delta$ over an i.i.d. sample of size $N$,

$$\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat R(h)] + \sqrt{\frac{D_\mathrm{KL}(Q \,\|\, P) + \log(1/\delta)}{2N}}$$

where $R(h)$ is the true risk, $\hat R(h)$ the empirical risk, and $D_\mathrm{KL}(Q \,\|\, P)$ the Kullback-Leibler divergence between posterior and prior. The bound holds simultaneously for every $Q$, so we may select $Q$ after seeing the data.

Tighter variants include the Seeger-Langford bound, which uses the binary-KL function $\mathrm{kl}(p \,\|\, q) = p \log(p/q) + (1-p)\log((1-p)/(1-q))$ and gives

$$\mathrm{kl}\big(\mathbb{E}_{h \sim Q}[\hat R(h)] \,\big\|\, \mathbb{E}_{h \sim Q}[R(h)]\big) \leq \frac{D_\mathrm{KL}(Q \,\|\, P) + \log\frac{2\sqrt N}{\delta}}{N}.$$

For low empirical risk, this binary-KL form is substantially tighter than the square-root form.

Why PAC-Bayes matters for deep learning. Standard uniform-convergence bounds (VC, Rademacher) are vacuous for modern networks because their capacity measures scale with parameter count, far exceeding $N$. Dziugaite and Roy (2017) showed that by training a stochastic neural network to minimise the PAC-Bayes bound directly, treating $Q$ as a Gaussian over weights, one obtains non-vacuous generalisation guarantees on MNIST: bound values below 20% test error for networks with millions of parameters. This was the first time an explicit, computable bound came close to observed performance for over-parameterised networks.

Choice of prior. The prior $P$ must be data-independent for the bound to hold, but it may depend on the data distribution and on a held-out subset. Data-dependent priors (Parrado-Hernández et al., 2012; Dziugaite et al., 2021) split the sample, fit $P$ on one half, then bound on the other half. This dramatically reduces $D_\mathrm{KL}(Q \,\|\, P)$ and yields the tightest reported bounds.

Connections. Setting $Q = \delta_h$ (a point mass) recovers Occam-style bounds with $D_\mathrm{KL} = -\log P(h)$, the description length of $h$. Optimising the bound over Gaussian $Q$ recovers an objective closely related to variational inference with a free-energy interpretation: empirical risk plays the role of negative log-likelihood and $D_\mathrm{KL}(Q \,\|\, P)$ acts as the regulariser. PAC-Bayes thus provides a frequentist justification for Bayesian deep learning.

Limitations. Bounds remain loose for image classifiers on ImageNet and for transformer language models. The prior must be chosen carefully, and the stochastic predictor differs from the deterministic network actually deployed. Nevertheless PAC-Bayes is the only generalisation theory that has produced quantitatively meaningful guarantees for trained deep networks.

Video

Related terms: Rademacher Complexity, VC Dimension, PAC Learning, KL Divergence, Statistical Learning Theory

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).