8.1 The unsupervised problem
Suppose someone hands you a hard drive of one million customer records and tells you to "find what's interesting." Nobody has labelled which customers are loyal, which are about to churn, or which sit in the same buying segment. There is no target to predict. Yet the data is not structureless: customers with similar incomes, ages and purchase histories cluster together; the dominant variation is along a few axes; a tiny fraction look anomalous and may be fraud.
Unsupervised learning is the family of methods that recover this hidden structure. Formally, given a sample $\mathcal{D} = \{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(n)}\}$ drawn from an unknown distribution $p_{\text{data}}(\mathbf{x})$ over $\mathbb{R}^d$, the objective is to learn something about the distribution itself rather than about a conditional $p(y\mid\mathbf{x})$. This is harder, in a precise sense, than supervised learning: there is no scalar loss that we can minimise without making modelling assumptions about what "structure" means.
Four canonical sub-problems organise the field.
Density estimation. Estimate $\hat p(\mathbf{x}) \approx p_{\text{data}}(\mathbf{x})$ as a function we can evaluate or sample from. Examples: Gaussian fits, mixture models, kernel density estimation, normalising flows, autoregressive models. Density estimation underpins everything else, classification ($p(y\mid\mathbf{x}) \propto p(\mathbf{x}\mid y)p(y)$), anomaly detection ($\hat p(\mathbf{x})$ small implies outlier), generation ($\mathbf{x}\sim\hat p$).
Clustering. Partition the data into groups whose members are similar to each other and dissimilar to members of other groups. Examples: k-means, Gaussian mixture models, agglomerative clustering, DBSCAN, spectral clustering. Clustering is implicit density estimation: we are inferring a discrete latent variable $z\in\{1,\dots,K\}$ such that $p(\mathbf{x}) = \sum_z p(z)p(\mathbf{x}\mid z)$.
Dimensionality reduction. Find a lower-dimensional representation $\mathbf{z}\in\mathbb{R}^k$ with $k\ll d$ that retains most of the information in $\mathbf{x}$. Examples: principal component analysis, kernel PCA, autoencoders, t-SNE, UMAP, manifold-learning algorithms. Dimensionality reduction is also implicit density estimation: it assumes $p_{\text{data}}$ concentrates near a low-dimensional manifold.
Generative modelling. Train a model from which we can sample new $\mathbf{x}^{\text{new}}\sim\hat p$ that resemble the training data. This sits at the boundary of density estimation and is the topic of Chapter 14; here we restrict attention to the simpler classical generators: GMMs and Latent Dirichlet Allocation.
These categories overlap. PCA can be cast as a Gaussian density estimator (probabilistic PCA), GMMs are simultaneously density estimators and clustering models, and autoencoders sit between dimensionality reduction and generation. The unifying theme is that we estimate something about $p_{\text{data}}$ without supervision.
Why bother
Three reasons drive interest in unsupervised methods.
First, labelled data is expensive. Having a radiologist label a chest X-ray costs seconds of clinician time per image; collecting tens of millions costs millions of pounds. Unlabelled images, by contrast, accumulate freely in PACS archives. Methods that learn from unlabelled data convert that abundance into model quality.
Second, labels can be wrong, missing or biased. In economics, "income" is reported with noise and selection bias. In medicine, "diagnosis" depends on the clinician and the protocol. Unsupervised analysis surfaces patterns that are present in the data even when the labels disagree.
Third, we sometimes do not yet know what to ask. Exploratory data analysis, "what's in this data?", precedes hypothesis-driven modelling. Clustering a single-cell RNA dataset reveals new cell types. Dimensionality reduction of voting patterns reveals political dimensions nobody had named. The unsupervised stage shapes the supervised questions.
Three things that go wrong
Without labels, we cannot do held-out test-set evaluation in the supervised sense. Three failure modes recur.
Wrong notion of similarity. All methods in this chapter depend on a metric, kernel, or generative form that defines what "close" means. Use Euclidean distance on raw pixel intensities of MNIST and translation by one pixel makes a digit look almost as different from itself as it does from any other digit. The choice of metric is a choice of inductive bias and is rarely innocent.
Confirmation by visualisation. Two-dimensional projections from t-SNE or UMAP can produce well-separated blobs even when the underlying data has no cluster structure, because the algorithms aggressively separate neighbours from non-neighbours. The blobs you see are partly an artefact of the algorithm's bias and the perplexity hyperparameter.
Spurious clusters. k-means with $K=10$ will return ten clusters whether or not the data has ten groups; agglomerative clustering will build a dendrogram for a uniform sample on a square; LDA will find topics in noise. Every clustering method requires a model-selection step to ask whether the structure is real.
The defensive posture: treat unsupervised results as hypotheses, validate against held-out data, downstream tasks, or domain knowledge, and publish negative findings (no clusters found) as readily as positive ones.
Unsupervised learning trades labels for assumptions. Every method encodes a particular notion of "structure", spherical clusters, low-rank linear subspace, smooth manifold, sparse topics. The chosen assumption is the most consequential modelling decision.
Historical context
Unsupervised methods predate the term "machine learning". The arithmetic mean and the empirical covariance matrix are unsupervised in the modern sense. Karl Pearson introduced principal component analysis in 1901 to study lines of "best fit" through correlated biometric measurements; Harold Hotelling rediscovered and formalised the method in 1933 in the context of psychological testing. K-means, although attributed to Stuart Lloyd's 1957 internal Bell Labs report (published 1982), shares its essence with earlier vector quantisation work in signal processing. Hierarchical clustering goes back to Sokal and Sneath's 1963 Principles of Numerical Taxonomy, which sought to make biological classification objective and reproducible. The expectation-maximisation algorithm was given its name and general form by Dempster, Laird and Rubin in 1977, although special cases existed for decades.
The 1990s brought non-parametric estimation into machine learning: kernel density estimation, kernel PCA (Schölkopf, Smola and Müller 1997), and the first wave of manifold-learning algorithms (Tenenbaum's Isomap 2000, Roweis and Saul's locally linear embedding 2000, Belkin and Niyogi's Laplacian eigenmaps 2003). The 2000s added density-based methods (DBSCAN 1996, OPTICS 1999, HDBSCAN 2013), spectral clustering (Shi and Malik 2000, Ng-Jordan-Weiss 2002), and topic models (LDA 2003). The 2010s brought neural autoencoders into the mainstream and culminated in t-SNE (2008) and UMAP (2018) as visualisation tools. The 2020s have seen the unsupervised vocabulary partly absorbed into the broader paradigm of self-supervised learning (§8.17).
This chapter sticks largely with the classical curriculum because the classical methods retain practical importance, are easier to reason about analytically, and form the conceptual scaffolding on which the modern methods rest.
What this chapter covers
The chapter is organised by the family of structure each method assumes.
§8.2 surveys density estimation, histograms, kernel methods, and the parametric/non-parametric trade-off that estimates $p(\mathbf{x})$ directly. §8.3 introduces k-means, the workhorse clustering algorithm, derives Lloyd's algorithm as coordinate descent, presents k-means++ initialisation, and discusses how to choose the number of clusters. §8.4 covers hierarchical clustering and its dendrograms, which give nested partitions rather than flat groupings. §8.5 generalises k-means to Gaussian mixture models with the EM algorithm, exposing the connection between hard and soft clustering. §8.6 turns to density-based clustering with DBSCAN and HDBSCAN, which find arbitrarily shaped clusters and explicitly label noise. §8.7 closes the clustering material with spectral clustering, which reframes the problem as graph partitioning and solves it through eigenvectors of the Laplacian.
§8.8 reviews principal component analysis, the foundational linear dimensionality reduction technique, derived three ways. §8.9 generalises PCA to non-linear feature spaces via the kernel trick. §8.10 and §8.11 cover the modern visualisation tools t-SNE and UMAP, and §8.12 surveys the early-2000s lineage of manifold learning methods (Isomap, locally linear embedding, Laplacian eigenmaps) from which they descend. §8.13 introduces autoencoders, the neural-network counterpart to PCA, and their denoising, sparse and contractive variants.
§8.14 to §8.16 cover specific tasks that draw on the methods of the preceding sections: topic modelling with latent Dirichlet allocation, anomaly detection across a range of techniques, and the evaluation of unsupervised methods using internal indices, external indices and downstream tasks. §8.17 then moves to self-supervised pretraining, the modern incarnation of unsupervised learning that powers every foundation model since 2018.
Notation and conventions
Throughout this chapter, $\mathcal{D}=\{\mathbf{x}^{(1)},\dots,\mathbf{x}^{(n)}\}$ is the data sample, with each $\mathbf{x}^{(i)}\in\mathbb{R}^d$. We write $X\in\mathbb{R}^{n\times d}$ for the data matrix whose rows are the $\mathbf{x}^{(i)}$. Centred quantities use a tilde: $\tilde{\mathbf{x}}^{(i)} = \mathbf{x}^{(i)}-\bar{\mathbf{x}}$, $\tilde X$ is the centred data matrix. We use lowercase Greek letters for parameters ($\theta,\mu,\sigma,\pi$) and lowercase $z$ for latent discrete labels (cluster, topic). Matrix transpose is $X^{\top}$, the Euclidean norm is $\lVert\cdot\rVert$, and $\lvert\cdot\rvert$ for the determinant of a matrix. Kernel matrices are denoted $K\in\mathbb{R}^{n\times n}$ with $K_{ij}=k(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$.