Start with each point its own cluster, merge the closest pair, repeat.
From the chapter: Chapter 8: Unsupervised Learning
Glossary: hierarchical clustering, dendrogram
Transcript
Agglomerative hierarchical clustering. Start with every data point as its own cluster.
Find the pair of closest clusters. Merge them. Now we have one fewer cluster.
Find the next closest pair. Merge. Repeat until only one cluster, containing all points, remains.
The merges form a tree. The dendrogram. Vertical position is the distance at which two clusters merged. Long branches represent well-separated clusters; short ones represent points that joined together early.
Cut the dendrogram at any height. Clusters above the cut become the final clustering. Cut high, get few large clusters. Cut low, get many small clusters.
Different linkage criteria produce different dendrograms. Single linkage uses the closest pair of points between clusters; produces stringy chains. Complete linkage uses the farthest pair; produces compact balls. Average linkage averages all pairs; a balanced compromise.
Ward's method merges the pair that increases within-cluster variance the least. Often the default choice.
Unlike k-means, hierarchical clustering does not require choosing k upfront. The full hierarchy is computed once, and you can pick any cut afterwards.
Cost: O of n squared in memory, O of n cubed in time for naive implementations. Practical for thousands of points, not for millions.