Glossary

Dimensionality Reduction

Dimensionality Reduction projects high-dimensional data onto lower-dimensional representations, facilitating visualisation, compression, noise removal, and downstream learning. The motivation is both practical (reducing memory and compute) and theoretical (the curse of dimensionality makes high-dimensional learning difficult). The manifold hypothesis posits that natural data, though represented in high-dimensional spaces, actually lies on or near lower-dimensional manifolds, dimensionality reduction attempts to uncover these.

Linear methods include PCA (Principal Component Analysis), which projects onto directions of maximum variance; LDA (Linear Discriminant Analysis), which maximises class separation; and Factor Analysis, which models observations as linear combinations of latent factors plus noise. PCA is the workhorse: simple, fast, and provides the optimal linear approximation for preserving variance.

Nonlinear methods capture more complex structure. t-SNE (t-distributed Stochastic Neighbour Embedding) preserves local neighbourhoods and excels at visualising clusters but distorts global distances. UMAP (Uniform Manifold Approximation and Projection) offers similar visualisation quality with better global structure preservation and faster computation. Autoencoders learn nonlinear encoder-decoder networks whose bottleneck representations serve as reduced-dimensional codes. Kernel PCA applies the kernel trick to linear PCA. The choice depends on the goal: PCA for speed and linearity, t-SNE/UMAP for visualisation, autoencoders for learned representations that generalise to new data.

Interactive

t-SNE unfolds high-dimensional clusters into 2D. Pairwise similarities in many dimensions become 2D positions that preserve neighbourhoods.

Related terms: Principal Component Analysis, Autoencoder, Curse of Dimensionality

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.