Rademacher Complexity, Glossary, Textbook of AI

Rademacher complexity quantifies how well a function class $\mathcal{F}$ can fit random noise on a sample. Given a sample $S = (x_1, \ldots, x_N)$ and i.i.d. Rademacher variables $\sigma_n \in \{-1, +1\}$ uniformly distributed, the empirical Rademacher complexity of $\mathcal{F}$ is

$$\hat{\mathcal{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma\left[\sup_{f \in \mathcal{F}} \frac{1}{N} \sum_{n=1}^{N} \sigma_n f(x_n)\right]$$

and the Rademacher complexity is its expectation $\mathcal{R}_N(\mathcal{F}) = \mathbb{E}_S[\hat{\mathcal{R}}_S(\mathcal{F})]$. Intuitively, if $\mathcal{F}$ is rich enough to correlate strongly with arbitrary $\pm 1$ labels, it can also overfit, so $\mathcal{R}_N$ acts as a capacity measure.

Generalisation bound. For any loss $\ell$ that is bounded in $[0, 1]$ and Lipschitz in its first argument with constant $L$, with probability at least $1 - \delta$ over $S$,

$$\sup_{f \in \mathcal{F}} \big| R(f) - \hat R(f) \big| \leq 2 L \mathcal{R}_N(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2N}}.$$

The bound is data-dependent through $\hat{\mathcal{R}}_S$, which can be estimated empirically, and it is distribution-dependent through the expectation over $S$.

Comparison with VC dimension. For binary classifiers with VC dimension $d$, Sauer's lemma yields $\mathcal{R}_N(\mathcal{F}) \leq O(\sqrt{d/N})$, recovering classical VC bounds. But Rademacher complexity is often strictly tighter: it adapts to the data distribution and to margin information. For a class of linear classifiers with margin $\gamma$ and inputs of $\ell_2$ norm at most $B$, $\mathcal{R}_N \leq B / (\gamma \sqrt N)$ regardless of input dimension, explaining why support-vector machines generalise despite high-dimensional features.

Contraction inequality. Talagrand's contraction lemma gives $\mathcal{R}_N(\phi \circ \mathcal{F}) \leq L \cdot \mathcal{R}_N(\mathcal{F})$ for any $L$-Lipschitz $\phi$, allowing complexity to propagate through layers of a neural network. Applying this iteratively yields layer-wise bounds for deep networks where each layer contributes its spectral norm.

For neural networks. For a depth-$D$ network with weight matrices $W_1, \ldots, W_D$ and 1-Lipschitz activations,

$$\mathcal{R}_N(\mathcal{F}) \leq \frac{B}{\sqrt N} \prod_{d=1}^{D} \|W_d\|_2$$

where $\|\cdot\|_2$ is the spectral norm and $B$ bounds the input. Bartlett, Foster and Telgarsky (2017) sharpened this to depend on the product of spectral norms times distance from initialisation, yielding bounds that match the empirical generalisation gap for moderately sized networks but remain loose for ImageNet-scale models.

Why it matters. Rademacher complexity provides the cleanest theoretical handle on margin-based generalisation. It motivates spectral norm regularisation, Lipschitz-constrained training, and the analysis of adversarial robustness, where the relevant function class includes perturbations of the input. It also underlies modern proofs of generalisation for kernel methods, boosting, and the early layers of deep learning theory.

Discussed in:

Chapter 6: ML Fundamentals, Generalisation in Deep Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).