Chapter Eight

Unsupervised Learning

Learning Objectives
  1. Describe the k-means algorithm and its Lloyd's-update iterations and limitations
  2. Build and interpret hierarchical clustering dendrograms with different linkage criteria
  3. Apply DBSCAN to find density-based clusters and identify noise points
  4. Use principal component analysis (PCA) to reduce dimensionality while preserving variance
  5. Detect anomalies using statistical, distance-based, and model-based techniques

You have a million customer records but no labels. Nobody has marked which customers are similar or which ones are unusual. You cannot use supervised learning because there is nothing to supervise. Yet the data still contains structure — groups of similar customers, dominant patterns, and rare outliers. Unsupervised learning finds that structure without being told what to look for.

This matters because labelled data is expensive. Hiring experts to annotate thousands of examples costs time and money. Unlabelled data, by contrast, is everywhere. Unsupervised methods let you extract value from it.

The algorithms in this chapter fall into three families:

  • Clustering (Sections 8.1–8.3): group similar data points together.
  • Dimensionality reduction (Section 8.4): compress high-dimensional data into fewer dimensions while keeping the important variation.
  • Anomaly detection (Section 8.5): find data points that look different from the rest — fraud, defects, intrusions.

All unsupervised methods depend on a good notion of similarity or distance. If the metric does not match the structure you care about, the results will be misleading.

8.1   K-Means Clustering

K-means is the most widely used clustering algorithm. It is fast, simple, and scales to large datasets.

How It Works

You choose a number of clusters K. The algorithm then alternates between two steps until nothing changes:

  1. Assignment step: assign each point to the nearest centroid (cluster centre), using Euclidean distance.
  2. Update step: recompute each centroid as the mean of its assigned points.

Each iteration decreases (or holds constant) the within-cluster sum of squares (WCSS): Σk=1^K^ Σx ∈ Ckxμk‖^2^. Convergence is guaranteed, but only to a local minimum — not necessarily the best solution.

Initialisation Matters

Random starting centroids can produce poor results. The k-means++ strategy Arthur, 2007 fixes this by spreading out the initial centroids. The first is chosen at random. Each subsequent one is chosen with probability proportional to its squared distance from the nearest existing centroid. This is provably O(log K)-competitive with the optimal clustering and works much better in practice.

Choosing K

The true number of clusters is rarely known. Three common approaches:

  • Elbow method: plot WCSS against K and look for a kink where the improvement slows sharply. Often ambiguous.
  • Silhouette score: for each point, compare the mean distance to its own cluster with the mean distance to the nearest other cluster. Values range from −1 to 1; higher is better.
  • Gap statistic: compare observed WCSS to what you would expect under a uniform (no-cluster) distribution. Pick the smallest K where the gap is significant.

Domain knowledge often helps more than any metric.

Limitations

K-means assumes clusters are convex, roughly spherical, and similar in size. When these assumptions fail, it struggles. Elongated clusters may be split. Clusters of very different densities may be merged.

K-medoids (PAM) uses actual data points as centres and works with any distance metric, at higher cost. Mini-batch k-means processes random subsets per step, trading a small accuracy loss for dramatic speed gains on large datasets.

Why K-Means Endures

Despite its limitations, k-means is indispensable:

  • It initialises more complex models like Gaussian mixture models.
  • Cluster labels become features for downstream models.
  • It powers vector quantisation in image compression Oord, 2017.
  • Its O(nKd) per-iteration cost and parallel structure handle millions of points.

It is the baseline against which all other clustering methods are measured.

8.2   Hierarchical Clustering

Hierarchical clustering builds a tree of nested groups called a dendrogram. Unlike k-means, it does not force you to choose K upfront. You can cut the tree at any height to get a different number of clusters.

Two Strategies

Agglomerative (bottom-up): start with each point as its own cluster. Merge the two closest clusters at each step until one cluster remains. This is by far the more common approach.

Divisive (top-down): start with all points in one cluster. Recursively split until each point is alone. Less common due to higher cost.

Linkage Criteria

The linkage criterion defines "distance between clusters." The choice shapes the result:

  • Single linkage: minimum distance between any pair of points. Finds elongated clusters but is vulnerable to the "chaining effect," where noise bridges distinct groups.
  • Complete linkage: maximum pairwise distance. Produces compact clusters but is sensitive to outliers.
  • Average linkage (UPGMA): mean pairwise distance. A reasonable compromise.
  • Ward's method: merges the pair that increases total within-cluster variance the least. Often the best default.

Computational Cost

Agglomerative clustering typically costs O(n^2^ log n) time and O(n^2^) space for the distance matrix. This limits it to at most tens of thousands of points. BIRCH compresses data into subclusters first, then applies agglomerative clustering to the compressed version.

Reading the Dendrogram

The height at which two clusters merge reflects their dissimilarity. Long vertical lines mean well-separated groups. Short lines mean ambiguous boundaries. The cophenetic correlation measures how faithfully the tree preserves the original distances.

Where It Works Best

Hierarchical clustering shines when data has natural nested structure. Phylogenetic trees in biology. Topic hierarchies in document analysis. Community structure in social networks. Its deterministic output (given a linkage criterion) is an advantage over k-means, but early merge errors propagate through the entire tree — so careful preprocessing matters.

8.3   DBSCAN

DBSCAN Ester, 1996 defines clusters as dense regions separated by sparse ones. It does not need you to specify the number of clusters. It finds clusters of any shape. And it automatically labels outliers as noise.

How It Works

Two parameters control the algorithm:

  • ε (epsilon): the radius of a point's neighbourhood.
  • minPts: the minimum number of points required for a region to count as dense.

Points are classified into three types:

  • Core point: has at least minPts neighbours within distance ε.
  • Border point: falls within ε of a core point but does not itself meet the density threshold.
  • Noise point: neither core nor border.

Clusters form by linking core points whose neighbourhoods overlap. The algorithm iterates through unvisited points, starting a new cluster whenever it finds an unvisited core point and expanding it by recursively including all density-reachable points.

Complexity

With a spatial index (KD-tree or R-tree), DBSCAN runs in O(n log n). Without one, it is O(n^2^). It handles millions of points in low-to-moderate dimensions.

Choosing Parameters

The k-distance graph helps pick ε. Plot each point's distance to its k-th nearest neighbour, sorted in decreasing order. A sharp elbow suggests a natural threshold. A common rule sets minPtsd + 1, where d is the dimensionality. For 2D data, minPts = 4 is standard.

The main weakness: a single global ε cannot handle clusters of very different densities.

Extensions

OPTICS computes a reachability distance for each point, producing a plot from which clusters at multiple density levels can be extracted. HDBSCAN builds a full cluster hierarchy and selects the most stable clusters, effectively automating the choice of ε. These successors retain DBSCAN's strengths while reducing sensitivity to parameters.

8.4   Principal Component Analysis

PCA is the most fundamental dimensionality reduction technique. It finds a set of orthogonal directions — the principal components — along which the data varies the most.

The Idea

The first principal component captures the direction of greatest variance. The second captures the greatest variance orthogonal to the first. And so on. By keeping only the first k < d components, you get a lower-dimensional representation that preserves as much variation as possible while discarding noise.

The Maths

PCA computes the eigendecomposition of the data's covariance matrix C = (1/n) X^T^X, where X is mean-centred. The eigenvectors are the principal component directions. The eigenvalues λ1 ≥ λ2 ≥ … ≥ λd measure the variance each component captures.

The proportion of total variance explained by the first k components is (λ1 + … + λk) / (λ1 + … + λd). This tells you how much information your reduced representation retains.

Equivalently, PCA is the linear projection that minimises reconstruction error. Variance maximisation and error minimisation are two sides of the same coin.

Computation

In practice, you use the singular value decomposition (SVD): X = UΣV^T^. The columns of V are the principal components. The singular values relate to eigenvalues by λi = σi^2^/n. SVD is more numerically stable than eigendecomposition. Iterative methods extract just the top k components without computing the full decomposition. Randomised SVD achieves near-optimal accuracy in O(ndk) time for very large matrices.

Beyond Linear

PCA only finds linear structure. When data lies on a curved surface, linear projection may fail.

  • Kernel PCA: applies the kernel trick to perform PCA in a higher-dimensional space implicitly.
  • t-SNE Maaten, 2008: preserves local structure. Excellent for 2D/3D visualisation.
  • UMAP McInnes, 2018: similar quality to t-SNE but better global structure and faster.

Applications

PCA is used everywhere:

  • Eigenfaces: represent faces as combinations of principal components for recognition and compression.
  • Genomics: reveal population structure by projecting thousands of genetic markers onto a few axes.
  • Finance: identify dominant risk factors in portfolios.
  • Preprocessing: reduce feature dimensionality before running a classifier, alleviating the curse of dimensionality.

Scaling Warning

PCA is sensitive to feature scale. A variable measured in large units (e.g., income in pounds) will dominate the variance and the components. Standardise your data first — subtract the mean, divide by the standard deviation — unless all features share the same units and their variances are meaningful.

8.5   Anomaly Detection

Anomaly detection finds data points that differ significantly from the rest. These outliers might be fraud, equipment failures, network intrusions, or measurement errors. They might also be scientific discoveries — new particles or unknown astronomical objects.

The challenge is that anomalies are rare and diverse. You cannot collect labelled examples of every kind of anomaly. So most anomaly detection methods learn what "normal" looks like and flag anything that deviates.

Statistical Methods

The simplest approach models normal data as a Gaussian and flags points far from the mean. The Mahalanobis distance generalises the z-score to multiple dimensions, accounting for correlations:

D(x) = √[(xμ)^T^Σ^−1^(xμ)]

For more complex shapes, Gaussian mixture models (GMMs) estimate density with multiple Gaussians. Kernel density estimation (KDE) is a non-parametric alternative that places a smooth kernel around each training point.

Isolation Forest

The isolation forest Liu, 2008 takes a different approach. Instead of modelling normal data, it directly isolates anomalies. The logic: anomalous points are few and different, so they are easier to separate.

The algorithm builds an ensemble of random binary trees. Each tree picks random features and split values until every point sits in its own leaf. Anomalies, being in sparse regions, reach their leaf in fewer splits. The anomaly score is based on the average path length from root to leaf.

Isolation forests are efficient — O(n log n) per tree — handle high dimensions well, and need no distributional assumptions.

Local Outlier Factor (LOF)

LOF Breunig, 2000 compares the local density around a point to the density around its neighbours. Density is estimated from the average reachability distance to the k-nearest neighbours.

A LOF value well above 1 means the point sits in a sparser region than its neighbours. LOF catches local anomalies — points that are outliers relative to their neighbourhood, even if they would not stand out globally.

Other Methods

  • One-class SVM: learns a boundary enclosing normal data in kernel space. New points outside the boundary are flagged.
  • Autoencoders: neural networks trained to reconstruct normal inputs. Anomalies produce high reconstruction error because they lie outside the learned manifold.

Evaluation

Standard accuracy is misleading when anomalies are less than 1% of the data — a classifier that labels everything "normal" gets 99%+ accuracy. Use AUROC and AUPRC for threshold-independent evaluation. In practice, the false-positive rate is often the binding constraint. Too many false alarms erode trust and overwhelm investigators. Choose your method based on the real cost of missed detections versus false alarms, the availability of labelled data, and the need for explainable flags.