Unsupervised Learning: 8.4   Hierarchical clustering

Dr Chris Paton

8.4 Hierarchical clustering

Hierarchical clustering builds a tree of nested partitions called a dendrogram, rather than a single flat assignment of points to clusters. At the bottom of the tree, every observation sits in its own singleton cluster. At the top, all observations are bundled into a single group containing the entire dataset. Every horizontal cut through this tree corresponds to one possible flat clustering: cut close to the leaves and you obtain many small groups, cut close to the root and you obtain a few large ones. The analyst chooses the cut that best matches the question being asked, and can change their mind without re-running the algorithm.

This is a markedly different bargain from the one we struck in §8.3. K-means demanded that we commit to a value of $K$ before we saw any structure, then optimised within that constraint. Hierarchical clustering refuses to commit. Instead of a partition, it returns a complete account of the order in which points and groups draw together as the resolution coarsens. That account is itself a piece of evidence about the data: a long, lonely branch tells us that some group is genuinely distinct, while a flurry of merges at similar heights tells us that the boundary between clusters is fuzzy.

Two flavours exist. Agglomerative clustering is bottom-up: start with singletons and successively merge the closest pair until one cluster remains. Divisive clustering is top-down: start with the whole dataset and recursively split. Agglomerative is overwhelmingly the standard, because divisive requires a non-trivial sub-clustering routine at every level. The rest of this section concentrates on the agglomerative case.

Symbols Used Here

$d(\mathbf{x},\mathbf{y})$point-to-point distance metric (typically Euclidean)\

$D(A,B)$between-cluster distance, defined by a *linkage criterion*\

$h$height at which a horizontal cut produces a flat clustering\

$n$number of observations

8.4.1 Agglomerative algorithm

The recipe is short enough to state in three lines and long enough in execution to define an entire family of methods.

Initialise: place each of the $n$ observations in its own singleton cluster, so we begin with $n$ clusters.
Repeat: locate the pair of clusters $A,B$ with the smallest between-cluster distance $D(A,B)$ under the chosen linkage, merge them into $A\cup B$, and record the height of the merge.
Stop: when only a single cluster remains, or when a stopping rule (a target number of clusters, a maximum merge height) is met.

The result is a sequence of $n-1$ merges, each tagged with the height at which it occurred. Plotting the merges as a binary tree, with merge height on the vertical axis, gives the dendrogram.

A naive implementation maintains the full pairwise distance matrix and rescans it at every step, costing $O(n^3)$ time. Sophisticated implementations exploit the fact that most pairwise distances do not change after a merge: the nearest-neighbour chain algorithm of Müllner brings the cost to $O(n^2)$ time, and the SLINK algorithm achieves the same for single linkage with $O(n)$ memory. In practice, $O(n^2)$ memory is the binding constraint: storing a pairwise distance matrix for $n=10^5$ requires roughly $40$ GB at single precision, which is why agglomerative clustering is rarely applied to more than a few tens of thousands of observations without first compressing the data through a routine such as BIRCH.

The merge order is deterministic given the data, the metric, and the linkage criterion. Unlike k-means, there is no random initialisation to worry about and no need to run the algorithm multiple times. What there is to worry about is the choice of linkage, which can change the dendrogram beyond recognition.

8.4.2 Linkage criteria

A linkage criterion turns a point-to-point distance $d(\mathbf{x},\mathbf{y})$ into a cluster-to-cluster distance $D(A,B)$. Four definitions dominate practice.

Single linkage uses the minimum pairwise distance:

$$ D_{\text{single}}(A,B) = \min_{\mathbf{a}\in A,\,\mathbf{b}\in B}\,d(\mathbf{a},\mathbf{b}). $$

Two clusters are considered close as soon as any single pair of their members is close. This produces long, snaking clusters that follow chains of nearby points and is the natural choice when the underlying geometry really is chain-like (a winding river, a road network, a one-dimensional manifold embedded in higher dimensions). Its weakness is the chaining effect: a thin bridge of noise between two otherwise distinct dense regions is enough to merge them, because the algorithm only requires one short edge.

Complete linkage uses the maximum pairwise distance:

$$ D_{\text{complete}}(A,B) = \max_{\mathbf{a}\in A,\,\mathbf{b}\in B}\,d(\mathbf{a},\mathbf{b}). $$

Two clusters are considered close only if every pair of their members is close. This produces compact, roughly ball-shaped clusters and resists chaining. Its weakness is sensitivity to outliers: a single distant member of $A$ inflates $D(A,B)$ for every potential partner $B$, so outliers can either delay sensible merges or produce odd partitions.

Average linkage, also known as UPGMA (unweighted pair-group method with arithmetic mean), takes the mean of all pairwise distances:

$$ D_{\text{average}}(A,B) = \frac{1}{\lvert A\rvert\,\lvert B\rvert}\sum_{\mathbf{a}\in A}\sum_{\mathbf{b}\in B}\,d(\mathbf{a},\mathbf{b}). $$

This is a compromise between single and complete linkage. It is less prone to chaining than single linkage and less sensitive to outliers than complete linkage. Average linkage is the historical workhorse of numerical taxonomy and remains a sensible default when no specific cluster shape is expected.

Ward's method is the natural hierarchical partner of k-means. At each step it merges the pair $A,B$ that produces the smallest increase in the total within-cluster sum of squared deviations from the cluster means:

$$ D_{\text{Ward}}(A,B) = \frac{\lvert A\rvert\,\lvert B\rvert}{\lvert A\rvert+\lvert B\rvert}\,\lVert\bar{\mathbf{x}}_A - \bar{\mathbf{x}}_B\rVert_2^2. $$

The factor $\lvert A\rvert\,\lvert B\rvert/(\lvert A\rvert+\lvert B\rvert)$ penalises merges between large clusters more heavily than merges between small clusters. The minimised quantity is the same one that k-means optimises globally, which makes Ward's the implicit choice whenever the user wants a hierarchy that agrees with a k-means flat clustering at every cut. Ward's produces compact, roughly equally sized clusters and is the default linkage in scipy.cluster.hierarchy and sklearn.cluster.AgglomerativeClustering.

A practical rule: try Ward's first; switch to single linkage if you suspect the data has chain-like structure; reach for complete or average linkage if Ward's hugs cluster sizes too aggressively.

8.4.3 A worked example

Take five points on the real line: $A=0,\,B=1,\,C=2.5,\,D=4,\,E=5$. The pairwise distances are

	$A$	$B$	$C$	$D$	$E$
$A$	0	1	2.5	4	5
$B$	1	0	1.5	3	4
$C$	2.5	1.5	0	1.5	2.5
$D$	4	3	1.5	0	1
$E$	5	4	2.5	1	0

Use single linkage and walk through the merges.

Step 1. The smallest off-diagonal entry is $1$, attained by both $\{A,B\}$ and $\{D,E\}$. Ties in real datasets are broken arbitrarily; here we take $\{A,B\}$ first. Merge: $\{A,B\}$ at height $1$.

Step 2. Update the distance matrix. Under single linkage, $D(\{A,B\},X) = \min\{d(A,X),d(B,X)\}$, so the new distances are $D(\{A,B\},C)=1.5$, $D(\{A,B\},D)=3$, $D(\{A,B\},E)=4$. The smallest entry is now $1$, between $D$ and $E$. Merge: $\{D,E\}$ at height $1$.

Step 3. Updated distances: $D(\{A,B\},\{D,E\}) = \min\{3,4,4,5\}=3$ and $D(\{A,B\},C)=1.5$, $D(\{D,E\},C)=\min\{1.5,2.5\}=1.5$. Both $\{A,B\}$–$C$ and $\{D,E\}$–$C$ tie at $1.5$. Take $\{A,B,C\}$ first. Merge: $\{A,B,C\}$ at height $1.5$.

Step 4. Distances: $D(\{A,B,C\},\{D,E\})=\min\{3,4,1.5,2.5\}=1.5$. Merge: the whole dataset at height $1.5$.

The dendrogram has four merges at heights $\{1, 1, 1.5, 1.5\}$. Cutting at $h=0.5$ gives five clusters (the singletons). Cutting at $h=1.2$ produces three clusters: $\{A,B\}, \{C\}, \{D,E\}$. Cutting at $h=2$ produces a single cluster.

The longest gap between successive merge heights is from $1.5$ to the dataset diameter; the second-longest is from $1$ to $1.5$. Both gaps are short, which faithfully reflects the fact that this tiny one-dimensional dataset has no dramatic cluster structure. On a real dataset, a long vertical edge in the dendrogram, well separated from neighbouring merges, is the visual signature of a stable cut.

8.4.4 Reading the dendrogram

Two technical concepts make dendrogram reading more disciplined. The cophenetic distance between two leaves is the height of their lowest common ancestor in the tree. The cophenetic correlation coefficient is the Pearson correlation between the cophenetic distances and the original pairwise distances. Values above $0.7$ indicate that the dendrogram is a faithful summary of the metric structure of the data; values below $0.5$ are a warning that the chosen linkage has badly distorted the geometry.

Tools such as scipy.cluster.hierarchy.cophenet compute this in one line. Reporting it alongside any dendrogram is a small piece of methodological hygiene that catches a surprising number of silent failures.

8.4.5 Computational cost

The asymptotic story is simple. Naive agglomerative clustering is $O(n^3)$ in time and $O(n^2)$ in memory, because the distance matrix has $\binom{n}{2}$ entries and the algorithm performs $n-1$ merges, each of which scans the matrix. Müllner's nearest-neighbour chain algorithm reduces the time cost to $O(n^2)$ for any of the four linkages above, and SLINK achieves $O(n^2)$ time with $O(n)$ memory for single linkage specifically. Memory is the binding constraint: the pairwise distance matrix at $32$-bit precision occupies $4n^2$ bytes, which is $40$ MB at $n=10^4$, $4$ GB at $n=10^5$, and $400$ GB at $n=10^6$. Practical deployments cap out around $n=5\times 10^4$ on a workstation. BIRCH is the standard escape hatch: it builds a CF-tree of compact micro-clusters in a single pass, then runs agglomerative clustering on the compressed summary.

8.4.6 When hierarchical clustering is the right tool

Three properties point towards hierarchical clustering rather than k-means or DBSCAN.

The data has natural hierarchical structure. Phylogenetic trees in biology, taxonomies in librarianship, document hierarchies in topic analysis, organisational charts in social network analysis, customer segmentation with sub-segments, all fit the dendrogram view of the world cleanly. A flat partition discards the tree information that the domain itself supplies.

The choice of $K$ is exploratory rather than known. If the analyst genuinely does not know how many clusters there should be, it is better to inspect a dendrogram and decide afterwards than to bake a guess into the algorithm. The dendrogram lets the data argue for a particular number of clusters by showing where long vertical edges sit.

The dataset is small enough. For $n$ below roughly $10^4$ on a laptop, hierarchical clustering is comfortable; below $5\times 10^4$ it is feasible with care; above that, BIRCH or a downstream approximation is needed. K-means scales to millions of points; hierarchical clustering does not.

Where hierarchical clustering is not the right tool: on very large datasets, on streaming data, or when the question is genuinely "give me $K$ clusters" with $K$ known in advance and clusters expected to be roughly spherical. K-means is faster, simpler, and equally informative in that regime.

8.4.7 What you should take away

Hierarchical clustering returns a dendrogram, a complete tree of nested partitions, rather than a single flat assignment. A horizontal cut at any height produces a flat clustering at that resolution.
Agglomerative is the standard flavour: start with singletons, repeatedly merge the closest pair under a linkage criterion, stop at the root. Divisive (top-down) clustering is rarer because it requires a sub-clustering routine at every level.
Four linkages dominate. Single (minimum) chains; complete (maximum) is compact but outlier-sensitive; average (mean) is a balanced default; Ward's (minimum increase in within-cluster sum of squares) is the hierarchical partner of k-means and the most common default.
Computational cost scales as $O(n^2)$ time and $O(n^2)$ memory with efficient implementations, which limits deployment to roughly $n\le 5\times 10^4$ on a workstation. BIRCH is the standard preprocessing step for larger datasets.
Reach for hierarchical clustering when the domain has genuine nested structure, when $K$ is exploratory, and when the dataset is small enough; reach for k-means or DBSCAN otherwise.