PCA (mathematical detail), Glossary, Textbook of AI

Principal component analysis (PCA) finds the directions of maximum variance in a dataset. Given centred data $X \in \mathbb{R}^{N \times d}$ (each feature has zero mean), PCA solves:

$$\max_{w: \|w\| = 1} \frac{1}{N} \sum_n (w^\top x_n)^2 = \max_{\|w\|=1} w^\top S w$$

where $S = \frac{1}{N} X^\top X \in \mathbb{R}^{d \times d}$ is the sample covariance matrix.

Solution: by the Rayleigh quotient, the maximum is attained at $w = v_1$, the eigenvector of $S$ corresponding to its largest eigenvalue $\lambda_1$. The maximum value is $\lambda_1$. Successive principal components are the next eigenvectors $v_2, v_3, \ldots$ (each orthogonal to the previous and maximising variance subject to that constraint).

Compactly: PCA = eigendecomposition of the sample covariance. With $S = V \Lambda V^\top$ (eigendecomposition), the principal components are the columns of $V$ ordered by decreasing eigenvalue, and the explained variance of component $k$ is $\lambda_k / \sum_j \lambda_j$.

Equivalent SVD formulation: with the SVD $X = U \Sigma V^\top$, the right singular vectors $V$ are the principal components and the squared singular values $\sigma_k^2 / N$ are the eigenvalues of the covariance.

Dimensionality reduction: project onto the first $k$ components

$$\hat X = X V_k$$

where $V_k \in \mathbb{R}^{d \times k}$ contains the top $k$ eigenvectors. Reconstruction error is minimised among all rank-$k$ projections by the Eckart–Young theorem:

$$\|X - X V_k V_k^\top\|_F^2 = \sum_{j=k+1}^d \sigma_j^2$$

Choosing $k$: scree plot of eigenvalues, cumulative explained-variance threshold (e.g. retain 95%), or Bayesian PCA / parallel analysis for model-selection-driven choice.

Practical recipe:

Centre the data: $X \leftarrow X - \mathrm{mean}(X)$.
Optionally standardise to unit variance per feature.
Compute SVD of $X$ (more numerically stable than computing $X^\top X$ first).
Project onto top $k$ singular vectors.

Computational notes: for $N \gg d$, eigendecomposition of $X^\top X$ ($d \times d$) is faster. For $d \gg N$, eigendecomposition of $X X^\top$ ($N \times N$) is faster (and gives the same singular values). For very large data, randomised SVD (Halko, Martinsson, Tropp 2011) computes the top $k$ singular components in $O(N d k)$ time with high accuracy.

Variants:

Kernel PCA: PCA in a feature space defined by a kernel.
Probabilistic PCA: a generative model whose MLE recovers PCA.
Sparse PCA: encourage sparsity in components.
Robust PCA: decompose $X = L + S$ into low-rank $L$ and sparse $S$ (handles outliers).
t-SNE / UMAP: non-linear alternatives that preserve local structure rather than global variance.

PCA remains one of the most widely-used dimensionality-reduction and exploratory-data-analysis techniques.

Video

Discussed in:

Chapter 8: Unsupervised Learning, Unsupervised Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).