Principal Component Analysis, Glossary, Textbook of AI

Also known as: PCA

Principal Component Analysis (PCA) is a linear dimensionality-reduction technique that finds orthogonal directions in data of maximum variance. Given a centred data matrix $X \in \mathbb{R}^{N \times d}$ (rows are samples, columns are features, columns have zero mean) with sample covariance $S = \frac{1}{N} X^\top X$, the principal components are the eigenvectors $v_1, v_2, \ldots, v_d$ of $S$ ordered by decreasing eigenvalue $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$. Equivalently, they are the right singular vectors of $X$ from its singular value decomposition $X = U \Sigma V^\top$, with $\lambda_i = \sigma_i^2 / N$.

The first principal component $v_1$ is the direction along which the projected data $X v_1$ has maximum variance; the second is the direction of maximum variance subject to being orthogonal to the first; and so on. Equivalently, the rank-$k$ truncation $X_k = U_k \Sigma_k V_k^\top$ is the best rank-$k$ approximation to $X$ in both Frobenius and spectral norm (the Eckart–Young–Mirsky theorem).

History

PCA was introduced by Karl Pearson in 1901 as a method of fitting lines and planes by minimising perpendicular distance, and rediscovered by Harold Hotelling in 1933 in the context of psychology and educational testing. The connection to the SVD, established by Eckart and Young in 1936, gave PCA its modern computational footing. The eigendecomposition view dominates statistical exposition; the SVD view dominates numerical practice because it avoids forming the covariance matrix explicitly.

Uses

Dimensionality reduction: project onto the top-$k$ components to retain most variance while reducing storage and downstream computation.
Data visualisation: 2-D scatter of the top two PCs reveals cluster structure and outliers.
Preprocessing: decorrelate features before downstream methods (regression, clustering).
Compression: low-rank approximation of high-dimensional data such as images and gene-expression matrices.
Whitening: divide by $\sqrt{\lambda_i}$ post-projection to obtain unit-variance, decorrelated features.
Eigenfaces (Turk & Pentland 1991): apply PCA to vectorised face images for early face recognition.
Latent semantic analysis (LSA) is essentially PCA applied to a term–document matrix.

Variants

Kernel PCA (Schölkopf, Smola & Müller 1998): apply PCA in a feature space induced by a kernel, capturing non-linear structure.
Sparse PCA: enforce sparsity in the loadings $v_i$ for interpretability.
Probabilistic PCA (Tipping & Bishop 1999): a Gaussian latent-variable generative model whose maximum-likelihood solution recovers PCA.
Robust PCA (Candès et al. 2009): decompose $X = L + S$ with low-rank $L$ and sparse outlier matrix $S$.
Incremental and randomised PCA scale to streaming or massive data.

Modern alternatives

For non-linear structure, t-SNE (van der Maaten & Hinton 2008) and UMAP (McInnes et al. 2018) preserve local neighbourhoods and produce visually striking 2-D embeddings, although neither defines a deterministic forward map. Autoencoders learn non-linear compressors with neural networks; a linear autoencoder with squared loss recovers PCA exactly. PCA remains ubiquitous as a fast, parameter-free baseline.

Video

Discussed in:

Chapter 7: Supervised Learning, Linear Algebra and Dimensionality Reduction
Chapter 8: Unsupervised Learning, Principal Component Analysis

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).