Unsupervised Learning: 8.8   Principal component analysis

Dr Chris Paton

8.8 Principal component analysis

Principal Component Analysis (PCA) is the most widely used linear dimensionality-reduction technique in machine learning. If you ask any working data scientist what they reach for when a dataset has too many features to plot, too many features to model directly, or too many features that look correlated, PCA will be the first answer almost every time. It is older than computers, Pearson described it in 1901 and Hotelling re-derived it in 1933, and it remains the default first step in countless analysis pipelines a century later.

We covered the linear-algebra derivation in §2.8. There we saw that PCA can be obtained three equivalent ways: by maximising projected variance, by minimising reconstruction error, or by computing the singular value decomposition (SVD) of the centred data matrix. All three routes converge on the same answer, and the SVD route is the numerically stable one. We also met probabilistic PCA (PPCA), which gives PCA a likelihood and a generative interpretation, and discussed the sensitivity of PCA to feature scaling.

This section is different. It puts PCA back into the unsupervised-learning context where it actually gets used. The previous section, §8.7, treated spectral clustering, another eigenvector-based method, but applied to a similarity graph rather than to raw features. The next section, §8.9, generalises PCA to non-linear feature spaces via the kernel trick. The current section sits between those two: it focuses on the practical decisions that surround PCA in a real workflow, where to slot it into the pipeline, when to whiten the output, how to choose the target dimension, and how to recognise the situations where PCA simply will not work.

Symbols Used Here

$\mathbf{X}$data matrix

$\mathbf{Z}$projected data

$k$target dimension

Quick recap

Given $n$ samples in $d$ dimensions stacked into a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, PCA finds a small number $k$ of orthogonal directions along which the data varies the most. The first principal component is the unit vector $\mathbf{w}_1$ that maximises the variance of the projection $\mathbf{X}\mathbf{w}_1$. The second principal component is the direction of greatest variance subject to being orthogonal to the first, and so on. The components are the top $k$ eigenvectors of the empirical covariance matrix $S = \tfrac{1}{n}\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}$, where $\tilde{\mathbf{X}}$ is the data with the column mean subtracted from each row.

In practice we compute PCA via the SVD of the centred data, $\tilde{\mathbf{X}} = U \Sigma V^\top$. The columns of $V$ are the principal components and the singular values $\sigma_i$ encode their importance: the variance captured by the $i$-th component is $\sigma_i^2 / n$. Keeping the top $k$ components and discarding the rest gives the projected data $\mathbf{Z} = \tilde{\mathbf{X}} V_{[:,:k]} \in \mathbb{R}^{n \times k}$, a lower-dimensional representation that is optimal among all linear projections in the sense of preserving variance and minimising reconstruction error.

The full mathematical machinery, including the Lagrangian derivation, the equivalence of variance maximisation and reconstruction-error minimisation, and the role of SVD in avoiding numerical instability, lives in §2.8. The remainder of this section assumes that material and concentrates on what it takes to use PCA well.

Practical pipeline

In any realistic supervised or unsupervised workflow, PCA is one stage of a longer pipeline rather than a standalone procedure. The minimal sequence is four steps.

Standardise features to zero mean and unit variance using a StandardScaler or equivalent. Fit the scaler on training data only.
Fit PCA on the standardised training data, choosing $k$ either by a variance-explained threshold, by cross-validation on a downstream task, or by a fixed budget the next stage demands.
Transform the training and test data using the same fitted PCA, never refit on the test set.
Use the projected $\mathbf{Z}$ as the feature representation for whatever comes next: a classifier, a clustering algorithm, a visualisation, or a similarity search index.

In scikit-learn the whole sequence becomes a Pipeline([("scale", StandardScaler()), ("pca", PCA(n_components=k)), ("clf", LogisticRegression())]), and cross-validation is then applied to the pipeline as a single object. This is more than aesthetic: wrapping the steps in a pipeline is the only way to make sure the scaler and PCA are refit inside each cross-validation fold, on training data only, with the held-out fold transformed using the fold's own fit. Skipping that detail is the most common source of inflated cross-validation scores in academic work and the easiest mistake to spot in a code review.

The mistake to avoid is refitting PCA on test data. Because PCA depends only on the inputs and not on labels, it can feel harmless to compute the components on the full dataset before splitting. It is not. The components are functions of the test rows once you do that, and any test-set evaluation that follows is contaminated. The same logic applies when PCA is used inside a clustering or anomaly-detection pipeline: the fit must come from the training portion only. Whenever the dataset is so small that this looks wasteful, the right response is to use cross-validation rather than to leak.

A second pipeline question is where to place PCA relative to other transformations. PCA assumes linearity, so non-linear feature engineering, log transforms of skewed columns, one-hot encoding of categories, polynomial expansions, typically goes before PCA. Any feature-selection step that uses labels (recursive feature elimination, L1 regularisation in a model) should also come before PCA, since after projection the components are linear combinations of the originals and label-based selection no longer corresponds to dropping interpretable features. Imputation of missing values must precede PCA, because PCA itself does not handle missing data; PPCA via EM is one principled answer if missingness is widespread.

A third question is whether to apply PCA at all. A cheap diagnostic is to fit a model with and without PCA and compare cross-validated scores. If the gain is marginal, the simpler pipeline wins. PCA earns its place when the feature count is high relative to the sample size, when downstream training is expensive, when features are obviously correlated, or when the goal is visualisation rather than prediction.

Whitening

Whitening goes one step beyond plain projection. After projecting the centred data onto the top $k$ components, each coordinate of $\mathbf{Z}$ has mean zero but variance $\sigma_i^2 / n$, so the components have very different scales. Whitening divides each column of $\mathbf{Z}$ by its standard deviation, producing a representation with zero mean and identity covariance. In matrix form, if $\Sigma_k$ is the diagonal matrix of the top $k$ singular values, the whitened projection is $\mathbf{Z}_{\text{white}} = \tilde{\mathbf{X}} V_{[:,:k]} \Sigma_k^{-1} \sqrt{n}$.

The motivation is that many classical algorithms either explicitly or implicitly assume isotropic data. Logistic regression with L2 regularisation penalises every coefficient equally, which only makes sense if the features are on comparable scales. K-means and Gaussian mixture models with spherical covariances suffer when one direction has much larger variance than another, because the algorithm spends its capacity modelling the dominant axis. Independent component analysis (ICA) requires whitening as an explicit preprocessing step. Linear discriminant analysis benefits from it too. In each of these cases, feeding whitened PCA outputs rather than raw PCA outputs gives the downstream method a cleaner starting point.

There is a price. Whitening amplifies the smallest principal components, which are typically dominated by noise. Dividing a near-zero singular value into the projection inflates that direction far beyond its real importance. The standard fix is to apply whitening only to components whose singular values are above some threshold, or to add a small ridge $\epsilon$ inside the inverse: $(\Sigma_k + \epsilon I)^{-1}$. Both scikit-learn's PCA(whiten=True) and IncrementalPCA(whiten=True) perform unregularised whitening; if your data has a long tail of tiny components you will need to clip $k$ before whitening, or to use a custom transformer that adds the ridge.

Whitening is far less common in deep learning than in classical pipelines. Modern neural networks have their own normalisation tools, batch norm, layer norm, weight standardisation, and they operate inside the network where they can be re-estimated continuously as training progresses. A whitened input layer would freeze that normalisation at the start of training, which is rarely what you want when the features themselves are being relearned. The exception is contrastive self-supervised learning, where some methods (Barlow Twins, VICReg, the original W-MSE paper) explicitly enforce a whitened representation as part of the loss, treating decorrelation across the batch as a regulariser against representation collapse. This is whitening as a structural constraint on the learned features rather than as a preprocessing step on raw inputs.

A practical rule of thumb: whiten when the next stage is a classical method that benefits from isotropy, do not whiten when the next stage is a neural network with its own normalisation, and clip or ridge the small components in either case.

Choosing k

The choice of $k$ is the single most important hyperparameter in any PCA pipeline. Three families of criteria dominate practice: variance-explained thresholds, scree-plot elbows, and downstream cross-validation.

The variance-explained criterion is the simplest and most common. The proportion of variance explained by the first $k$ components is

$$ \mathrm{VE}(k) = \frac{\sum_{i \le k}\sigma_i^2}{\sum_i\sigma_i^2}. $$

Pick the smallest $k$ such that $\mathrm{VE}(k)$ exceeds a target, typically $0.90$, $0.95$, or $0.99$ depending on how much information loss the application can tolerate.

A worked example makes this concrete. Suppose the singular values of $\tilde{\mathbf{X}}$ yield component variances $[100, 50, 20, 10, 5, 2, 1]$, summing to $188$. The cumulative variance-explained values are $100/188 = 0.532$ at $k = 1$; $150/188 = 0.798$ at $k = 2$; $170/188 = 0.904$ at $k = 3$; $180/188 = 0.957$ at $k = 4$; $185/188 = 0.984$ at $k = 5$; $187/188 = 0.995$ at $k = 6$. To explain at least 95 per cent of the variance we therefore need the top four components. To explain at least 98 per cent we need the top five components, the cumulative figure of $185/188 = 0.984$ comfortably clears the 0.95 line, with the variance shifting from a marginal pass at $k = 4$ to an emphatic pass at $k = 5$. The target threshold is a domain decision: a vision pipeline preparing inputs for a classifier might happily discard 20 per cent of the variance, whereas a quantitative-finance application that propagates principal components into a risk model will rarely drop below 99 per cent.

The scree plot tells the same story graphically. Plot the singular values $\sigma_i$ (or the variances $\sigma_i^2$) against the index $i$. A typical plot drops sharply for the first few components and then flattens out. The "elbow", the point where the curve transitions from steep to flat, is taken as a natural cut-off. Scree plots work well when the elbow is unambiguous and badly when it is not, which is most of the time. Modern variants smooth the curve or fit broken-stick models, but the underlying ambiguity remains.

The cross-validation criterion treats $k$ as a hyperparameter of the full pipeline and selects it the same way you select any other hyperparameter: by holding out folds, training the pipeline at each candidate $k$, and picking the value with the best held-out score on the downstream task. This is more expensive than the variance-explained route but considerably more direct, because it measures the quantity you actually care about. When the downstream task is unsupervised, clustering or anomaly detection rather than classification, the cross-validation target has to be a proxy such as silhouette score, reconstruction error on held-out points, or stability across resamples.

When PCA fails

PCA assumes that the structure you care about lies along the directions of greatest variance and that those directions are linear. Both assumptions break in identifiable situations.

Non-linear manifolds defeat PCA. The classic teaching example is the Swiss roll, a two-dimensional sheet curled into three-dimensional space. The data lies on a smooth two-dimensional surface, but the linear directions of greatest variance run across the rolls rather than along them. Two principal components recover something useless, because the surface is non-linear. Kernel PCA, t-SNE, UMAP, Isomap, and locally linear embedding (LLE) are all designed for this case; we will meet them in the next several sections. The diagnostic is to plot the projected data and ask whether nearby points in the projection correspond to nearby points on the underlying manifold. If they do not, PCA is the wrong tool.

Discrete and categorical data also break PCA. Variance and the inner-product geometry that PCA inherits from it presuppose continuous coordinates. One-hot encoding a categorical variable inflates the dimension and produces a covariance matrix whose top components track the marginal frequencies of categories rather than meaningful structure. Non-negative matrix factorisation (NMF) is the standard linear answer for non-negative data; word embeddings, GLoVe, and learned categorical embeddings are the modern non-linear answers. For mixed continuous-categorical data, factor analysis of mixed data (FAMD) and multiple correspondence analysis (MCA) are more appropriate.

Heavy-tailed distributions and outliers also break PCA. A single extreme point can shift the empirical covariance enormously, dragging the top component toward the outlier. Robust PCA, in the formulation of Cand`es et al. (2011), decomposes a data matrix into a low-rank component plus a sparse component, isolating outliers into the sparse part and recovering a clean low-rank structure. Median-based PCA variants and trimmed estimators are simpler but less principled alternatives. If a sanity check shows that the first principal component is being dominated by a handful of rows, robust PCA is the right replacement.

PCA is also blind to discriminative structure. Variance and class separability are different objectives, and there are datasets where the discriminative direction is one of the smallest principal components. Linear discriminant analysis (LDA) supplies the supervised counterpart that aligns the projection with class separation rather than total variance.

Modern alternatives

Several non-linear and probabilistic generalisations of PCA have grown into standard tools.

Kernel PCA (the subject of §8.9) performs PCA in a feature space implicitly defined by a positive-definite kernel. The Gaussian and polynomial kernels are the most common choices. Kernel PCA recovers non-linear structure that ordinary PCA misses, at the cost of fitting an $n \times n$ kernel matrix and losing the simple feature-space interpretation.

t-SNE and UMAP are visualisation methods. Both project high-dimensional data onto two or three dimensions in a way that preserves local neighbourhood structure. They are excellent at revealing clusters and continuities at small scales and unreliable at large scales, where they distort distances and even the notion of which clusters are nearby. Use them for visualisation, not as features for downstream modelling.

Autoencoders learn non-linear codes via neural networks. An encoder maps $\mathbf{x}$ to a low-dimensional latent $\mathbf{z}$ and a decoder maps $\mathbf{z}$ back to $\mathbf{x}$; the network is trained to minimise reconstruction error. With linear activations and a squared loss, an autoencoder recovers PCA exactly. With non-linear activations it learns a non-linear manifold. Autoencoders are the natural deep-learning generalisation of PCA and we treat them in §8.13.

Variational autoencoders (VAEs) add a probabilistic structure on top, treating the latent code as a random variable with a prior and learning an approximate posterior via amortised variational inference. They are the probabilistic non-linear generalisation, mirroring the relationship between PPCA and ordinary PCA. We cover VAEs alongside other deep generative models.

What you should take away

PCA is a pipeline component, not a standalone tool. Place it after standardisation, fit it on training data only, and wrap the whole sequence in a pipeline so cross-validation is leak-free.
Whitening makes PCA outputs isotropic and helps classical methods that assume isotropy. Skip it for deep learning, and clip or ridge the small components if you do whiten.
Choose $k$ by variance-explained threshold, scree-plot elbow, or downstream cross-validation. The 95 per cent threshold and a worked example such as variances $[100, 50, 20, 10, 5, 2, 1]$ giving $185/188 = 0.984$ at the top five components illustrate the basic arithmetic.
PCA assumes linearity, continuous data, and no outliers. Switch to kernel PCA, NMF, robust PCA, t-SNE, UMAP, or autoencoders when those assumptions fail.
PCA remains the default first step for unsupervised exploration and the linear baseline against which more sophisticated methods are measured.