ML Fundamentals: 6.9   Curse of dimensionality

Dr Chris Paton

6.9 Curse of dimensionality

Many of our intuitions about geometry break in high dimensions, and these failures matter for ML.

Concentration of measure

In a $d$-dimensional unit ball, the ratio of the volume in the outer shell of width $\varepsilon$ to the total volume is $$ 1 - (1 - \varepsilon)^d \xrightarrow{d \to \infty} 1. $$ For $d = 1000$ and $\varepsilon = 0.01$, this is $1 - 0.99^{1000} \approx 1 - 4.3 \times 10^{-5}$, essentially 1. Almost all the mass of a high-dimensional ball lies in a thin shell near its surface, and almost all the mass of a high-dimensional Gaussian lies on a thin shell at radius $\sqrt{d}$.

A consequence: in high dimensions, all pairs of points are roughly the same distance apart. If $X, Y$ are independent draws from $\mathcal{N}(0, I_d)$, then $\|X - Y\|^2 / d \to 2$ in probability. The relative gap between the nearest and farthest neighbours of a query point shrinks to zero.

This phenomenon, concentration of measure, is why distance-based methods (KNN, kernel methods with isotropic kernels, naive distance metrics) struggle in high-dimensional ambient spaces. Every point looks equally far from every query.

The manifold hypothesis

But high-dimensional ML works in practice. Why? Because real data is not uniform in its ambient space. A 256×256 RGB image of a face lives in $\mathbb{R}^{196{,}608}$, but the manifold of natural face images has effective dimension perhaps in the hundreds or thousands. The same is true of speech, of natural language, and of nearly every other modality we care about.

The manifold hypothesis states that real-world high-dimensional data concentrates on a low-dimensional manifold embedded in the ambient space. Successful ML algorithms either exploit this structure explicitly (manifold learning, autoencoders) or implicitly (every successful deep network is, in effect, a learned coordinate chart on a data manifold).

The manifold hypothesis explains why deep models can generalise from millions of examples in spaces with hundreds of thousands of dimensions: the intrinsic dimension is much smaller than the ambient dimension. Pope et al. (2021) estimated the intrinsic dimensions of CIFAR-10, ImageNet, and CelebA to be around 25, 40, and 50 respectively, far below the pixel count.