Linear Algebra: 2.8   Principal component analysis via SVD

Dr Chris Paton

2.8 Principal component analysis via SVD

Principal Component Analysis, almost always shortened to PCA, is the single most-used dimensionality-reduction technique in machine learning. The motivation is easy to state in plain English. We frequently find ourselves with data that lives in many dimensions: a photograph of a face flattened into a vector of 4,096 pixel intensities, a customer described by 200 behavioural features, a fragment of brain tissue characterised by the expression levels of 20,000 genes. We sense, intuitively, that the true number of independent things going on must be much smaller than 200 or 20,000. Pixels in neighbouring regions of a face move together; gene expression levels are governed by a handful of regulatory programmes; customer behaviours cluster around a few archetypes. PCA finds new axes, called principal components, along which the data varies most, lets us keep only the top few of them, and so collapses the original high-dimensional cloud onto a two- or three-dimensional plane suitable for plotting, faster training, or noise reduction.

In §2.7 we met the Singular Value Decomposition and saw that any matrix factorises as $\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$. PCA is the application of that decomposition to a data matrix, one row per example, one column per feature. The technique appears everywhere in modern AI and adjacent sciences: fMRI analyses use PCA to extract dominant patterns of cortical activation, single-cell genomics uses it to visualise tens of thousands of cells in a 2D scatter plot, finance uses it to identify the dominant risk factors driving a portfolio, and computer vision pipelines have used it for face recognition since the early 1990s. It also turns up as a building block inside more elaborate algorithms, whitening before independent component analysis, initialisation for neural-network training, anomaly detection by reconstruction error. Understanding PCA fluently is one of the highest-leverage investments an absolute beginner can make.

Symbols Used Here

$\mathbf{X}$data matrix in $\mathbb{R}^{N \times D}$, each row one example

$N$number of examples

$D$original dimensionality

$k$number of components to keep

$\bar{\mathbf{x}}$mean of rows

$\tilde{\mathbf{X}}$centred data: $\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^\top$

$\boldsymbol{\Sigma}$covariance matrix, $D \times D$

$\sigma_i^2$variance along $i$-th principal component (= squared singular value / $(N-1)$)

$\mathbf{Z}$projected data in $\mathbb{R}^{N \times k}$

The intuition

Picture $N$ data points sitting inside $\mathbb{R}^D$. Each point might be a flattened photograph, where every coordinate records the brightness of one pixel, or a customer described by a list of behavioural numbers, frequency of visits, average basket size, time of day, and so on. The cloud of points is rarely a featureless ball. It almost always has shape: it is longer along some directions than others, and many of its $D$ axes carry redundant information because nearby pixels are correlated, or because behavioural features vary together. PCA asks a simple geometric question. If we could rotate our coordinate system to a new one, still orthogonal, still centred at the same origin, which orientation would line up best with the shape of the cloud?

The answer PCA gives is constructive. The first new axis, called the first principal component, points in the direction of greatest variance: it is the line along which the data is most spread out. The second principal component is forced to be orthogonal to the first and is then chosen to maximise variance subject to that constraint. The third is orthogonal to both and again maximises remaining variance, and so on, until we have $D$ new axes. Crucially, the variance is allocated rapidly: in real data the first axis often captures forty or fifty per cent of all the variance, the next one half of what is left, and so on, so that a handful of components account for nearly the whole shape. We then project every point onto only the first $k$ axes and discard the remaining $D - k$ coordinates, knowing that what we threw away was small.

A two-dimensional sketch makes the idea concrete. Scatter 1,000 points along the line $y = 0.8x$ and add a little Gaussian noise perpendicular to that line. Looked at in the original axes, the cloud is a tilted cigar. The first principal component is the long axis of the cigar, the line itself. The second principal component is the short axis, perpendicular to the first, capturing only the noise. The variance along PC1 is large; the variance along PC2 is tiny. If we keep only PC1 we have reduced two-dimensional data to one-dimensional data with negligible loss; if we plotted only the projected coordinate, the points would still tell us almost everything we needed about the original cloud.

The same picture generalises straightforwardly to $D$ dimensions. The cloud becomes a high-dimensional ellipsoid. The principal components are its axes, ordered from longest to shortest. The shortest axes are typically dominated by measurement noise or by redundant correlations. Keeping only the longest few captures the structure; throwing the rest away discards mostly noise. That is the entire intuitive content of PCA. Everything that follows, the mean-centring, the covariance matrix, the singular value decomposition, is the algebraic machinery for finding those axes efficiently and for measuring how much variance each one captures.

A related way of stating the same idea is in terms of reconstruction error. Suppose we are forced to describe each data point using only $k$ numbers instead of the original $D$. Which $k$ directions in $\mathbb{R}^D$ should we record the coordinates along, so that when we reconstitute the points by rebuilding from those $k$ coordinates we lose as little as possible? This is a least-squares optimisation over choices of subspace, and it has a clean answer: the optimal subspace is spanned by the top $k$ principal components, and the unavoidable reconstruction error equals the sum of the discarded eigenvalues $\sum_{i=k+1}^D \sigma_i^2 / (N - 1)$. Maximising captured variance and minimising squared reconstruction error are exactly the same optimisation seen from two angles, which is why the formulae match the Eckart–Young theorem of §2.7 so cleanly.

The algorithm

The PCA algorithm has three steps, each short, and a fourth optional step for projecting new points later.

Step 1: Centre the data. Compute the mean row $\bar{\mathbf{x}} = \frac{1}{N}\sum_{i=1}^N \mathbf{x}^{(i)}$, then subtract it from every row to obtain the centred matrix $\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}\bar{\mathbf{x}}^\top$. PCA is a statement about variance around the mean, so it requires zero-mean data; if we skip this step the first principal component will simply point at the mean of the cloud and tell us nothing about its shape. It is also good practice to scale each feature to unit variance when the original features have different units (kilograms versus millimetres, say), because PCA is sensitive to how vigorously each axis was originally measured.

Step 2: Take the SVD. Decompose $\tilde{\mathbf{X}} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$. The columns of $\mathbf{V}$, equivalently, the rows of $\mathbf{V}^\top$, are the principal components. They are unit vectors in $\mathbb{R}^D$, orthogonal to one another, and ordered so that $\mathbf{v}_1$ points along the direction of greatest variance, $\mathbf{v}_2$ along the next, and so on. The singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ on the diagonal of $\boldsymbol{\Sigma}$ encode how much of the data is aligned with each component. Specifically, the variance along the $i$-th principal component is $\sigma_i^2 / (N - 1)$.

Step 3: Project. Pick a target dimensionality $k \ll D$ and form the $D \times k$ matrix $\mathbf{V}_k$ consisting of the first $k$ columns of $\mathbf{V}$. The projected data is $\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k$, an $N \times k$ matrix whose rows are the new low-dimensional coordinates. To reconstruct an approximation in the original space we apply the inverse rotation: $\hat{\mathbf{X}} = \mathbf{Z}\mathbf{V}_k^\top$, then add the mean back. The resulting $\hat{\mathbf{X}}$ is the best rank-$k$ approximation to $\tilde{\mathbf{X}}$ in the sense of squared error, this is the Eckart–Young theorem from §2.7.

There is a second, equally valid recipe that goes via the covariance matrix. Form $\boldsymbol{\Sigma}_{\text{cov}} = \frac{1}{N - 1}\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}$, a symmetric $D \times D$ matrix, and compute its eigendecomposition. The eigenvectors are the principal components, and the eigenvalues are the variances along each component. The two recipes give the same answer because $\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}} = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top$, so the eigenvectors of the covariance matrix coincide with the right singular vectors of the centred data and the eigenvalues coincide with $\sigma_i^2 / (N - 1)$. Numerically, going via the SVD of $\tilde{\mathbf{X}}$ is preferred when $D$ is large because it avoids forming the $D \times D$ matrix $\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}$ explicitly, which can amplify rounding error.

Worked numerical example

Take five points in two dimensions: $$\mathbf{x}_1 = (2, 1), \quad \mathbf{x}_2 = (3, 2), \quad \mathbf{x}_3 = (4, 3), \quad \mathbf{x}_4 = (5, 4), \quad \mathbf{x}_5 = (6, 5).$$

The first thing to notice, by eye, is that all five points lie on the line $y = x - 1$. They are perfectly collinear. Whatever PCA does, it should detect this and tell us the data is intrinsically one-dimensional.

Centring. The mean is $\bar{\mathbf{x}} = (4, 3)$. Subtracting it from each row gives the centred points $(-2, -2), (-1, -1), (0, 0), (1, 1), (2, 2)$. They now sit on the line $y = x$ through the origin.

Covariance matrix. Form $\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}$: $$\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}} = \begin{pmatrix} -2 & -1 & 0 & 1 & 2 \\ -2 & -1 & 0 & 1 & 2 \end{pmatrix} \begin{pmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{pmatrix} = \begin{pmatrix} 10 & 10 \\ 10 & 10 \end{pmatrix}.$$ Dividing by $N - 1 = 4$ gives the sample covariance $$\boldsymbol{\Sigma}_{\text{cov}} = \begin{pmatrix} 2.5 & 2.5 \\ 2.5 & 2.5 \end{pmatrix}.$$

Eigenvalues and eigenvectors. The characteristic polynomial is $\det(\boldsymbol{\Sigma}_{\text{cov}} - \lambda \mathbf{I}) = (2.5 - \lambda)^2 - 2.5^2 = \lambda^2 - 5\lambda = \lambda(\lambda - 5)$, so the eigenvalues are $\lambda_1 = 5$ and $\lambda_2 = 0$. The eigenvector for $\lambda_1 = 5$ satisfies $(\boldsymbol{\Sigma}_{\text{cov}} - 5\mathbf{I})\mathbf{v} = \mathbf{0}$, which gives $-2.5 v_1 + 2.5 v_2 = 0$, so $v_1 = v_2$. Normalising yields $\mathbf{v}_1 = (1, 1)/\sqrt{2}$. The eigenvector for $\lambda_2 = 0$ is the orthogonal direction, $\mathbf{v}_2 = (1, -1)/\sqrt{2}$.

Variance explained. PC1 carries variance $\lambda_1 = 5$; PC2 carries variance $\lambda_2 = 0$. Total variance is $5$, so PC1 alone captures $5 / 5 = 100\%$ and PC2 captures $0\%$. The data is exactly one-dimensional, and PCA has found the line.

Projection. Project the centred points onto $\mathbf{v}_1$. The projection of a centred point $\tilde{\mathbf{x}}$ is $\tilde{\mathbf{x}}^\top \mathbf{v}_1$, a single number. For example, for $\mathbf{x}_5$ the centred coordinate is $(2, 2)$, and $(2, 2) \cdot (1, 1)/\sqrt{2} = 4/\sqrt{2} = 2\sqrt{2} \approx 2.83$. The five projected coordinates are $-2\sqrt{2}, -\sqrt{2}, 0, \sqrt{2}, 2\sqrt{2}$, and from these single numbers, together with the principal component direction, we can perfectly reconstruct the original 2D points.

Variance-explained

The total variance in a centred dataset equals the sum of all eigenvalues of the covariance matrix, which equals $\frac{1}{N - 1}\sum_{i=1}^D \sigma_i^2$. The fraction captured by the first $k$ principal components is therefore $$\text{cumulative variance explained} = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^D \sigma_i^2}.$$ This single ratio is the standard yardstick for deciding how aggressive a reduction we can afford. Common practice is to pick the smallest $k$ for which the cumulative variance reaches some pre-set threshold, 90%, 95% and 99% are all popular choices, and the right value depends on the downstream task. For visualisation we always pick $k = 2$ or $3$, regardless of how much variance that captures; for compression or speed-up before a classifier we typically aim for 90–99%.

A useful diagnostic is the scree plot, named because it resembles the loose stones at the base of a mountain: plot $\sigma_i^2$ on the vertical axis against the component index $i$ on the horizontal axis. Real data usually shows a sharp drop followed by a long flat tail, and the elbow of that curve is a sensible place to truncate. Imagine a 1,000-dimensional dataset whose top eigenvalues are $\sigma_1^2 = 100$, $\sigma_2^2 = 50$, $\sigma_3^2 = 30$, then a tail that decays exponentially towards zero. Adding up the values, we might find that the first 10 components explain 95% of the variance and the next 990 between them explain only 5%. That is a strong signal: keep 10, discard 990, and most of the data has survived. In practice you should always inspect both the scree plot and the cumulative-variance curve before settling on $k$, because the two views show different things, the scree plot reveals where the elbow lies, the cumulative curve tells you what fraction of the original signal you are committing to throw away.

When PCA fails

PCA is powerful but it is far from a universal solution. A short catalogue of its failure modes will save the beginner from hard-to-diagnose mistakes.

Non-linear structure. PCA is fundamentally linear: it asks for the best linear subspace through the data. If your data lies on a curved manifold, the canonical example is a "Swiss roll", a 2D sheet rolled up like pastry inside 3D space, PCA simply collapses the manifold along the rolling axis and destroys the structure. The fixes are non-linear: kernel PCA, t-SNE and UMAP for visualisation, autoencoders and variational autoencoders for learned non-linear codes.

Outliers. Because PCA optimises squared error, a single anomalous point a long way from the rest of the cloud can drag the first principal component towards itself. Robust PCA variants exist, replacing the squared-error objective with something less sensitive, typically an L1 penalty, and decomposing the data into a low-rank component plus a sparse outlier component.

Discrete or binary data. PCA implicitly models continuous, roughly Gaussian features. For sparse binary data (one-hot encodings, presence/absence indicators in genomics, term-document matrices in text) the centring step actually destroys sparsity, and the resulting components are hard to interpret. Better tools for those settings include non-negative matrix factorisation, latent Dirichlet allocation and SVD without centring (truncated SVD, sometimes called latent semantic analysis).

High-dimensional small-sample regimes. When $D \gg N$, for example 20,000 genes measured on 30 patients, the empirical covariance matrix is rank-deficient and its top eigenvectors become unstable, often capturing peculiarities of the small sample rather than reproducible biology. Regularised or sparse variants of PCA, or simply collecting more samples, are the answer.

Where PCA appears in AI

A handful of representative uses gives the flavour.

Pre-processing for downstream models. A common pipeline reduces a high-dimensional input to a moderate dimensionality with PCA before passing it to a classifier or clustering algorithm. The downstream method runs faster, suffers less from the curse of dimensionality, and is sometimes more accurate because the discarded components were mostly noise.

Visualisation. Almost every paper in single-cell biology, ecology or unsupervised representation learning includes a 2D PCA scatter plot of its data. It is the simplest tool for asking "does this dataset have obvious structure?" and the answer is often visible at a glance.

Whitening. Transforming the data so that all principal components have unit variance is called whitening. It removes the dominant correlations between features and is a common pre-processing step before independent component analysis, before some adversarial training procedures, and historically before deep-network training (the "ZCA whitening" of CIFAR-10 in early convolutional-network work).

Anomaly detection. Train PCA on a large sample of "normal" data, keep the top few components, and reconstruct new inputs through the projection-and-back operation. Normal points reconstruct almost perfectly because they lie close to the learned subspace; anomalies reconstruct poorly. The reconstruction error is a useful score.

Eigenfaces. In Turk and Pentland's 1991 face-recognition work, faces were represented as long pixel vectors and PCA was applied across a database of faces. The top principal components, reshaped back into images, looked like ghostly composite faces, the eigenfaces, and any new face could be identified by its handful of eigenface coefficients. The technique was the first practical demonstration that a high-dimensional perceptual problem could be tamed by linear projection.

What you should take away

PCA finds new orthogonal axes, the principal components, along which the data varies most, and lets you project onto only the top few of them with minimal loss.
The components are the right singular vectors of the centred data matrix; the variance along each component is $\sigma_i^2 / (N - 1)$, where $\sigma_i$ is the corresponding singular value.
Always centre before applying PCA and consider standardising features whose units differ; PCA is sensitive to scale.
Pick $k$ from a cumulative-variance threshold (90–99%) or from the elbow of the scree plot; for visualisation always pick $k = 2$ or $3$.
PCA is linear: it cannot capture curved manifolds, is dragged around by outliers, and assumes roughly Gaussian continuous features. When those assumptions fail, reach for kernel PCA, t-SNE, UMAP, autoencoders, or non-negative matrix factorisation.