Also known as: k-means
K-Means is the most widely used clustering algorithm, prized for its simplicity, speed, and scalability. It partitions $n$ data points into $K$ clusters by alternating two steps until convergence. In the assignment step, each point is assigned to the cluster whose centroid is nearest (by Euclidean distance). In the update step, each centroid is recomputed as the mean of the points currently assigned to it. This procedure monotonically decreases the within-cluster sum of squares (WCSS), also called inertia, until it reaches a local minimum.
K-means is guaranteed to converge but only to a local minimum, not necessarily the global one. The quality depends heavily on initialisation. K-means++ selects initial centroids spread out across the data: the first is chosen uniformly, and each subsequent centroid is chosen with probability proportional to its squared distance from the nearest existing centroid. This procedure is provably $O(\log K)$-competitive with the optimal clustering and dramatically improves convergence.
Choosing $K$ is a persistent challenge. The elbow method plots WCSS versus $K$ and looks for a kink. The silhouette score compares intra-cluster to nearest-cluster distances. The gap statistic compares observed WCSS to that under a uniform null distribution. K-means implicitly assumes convex, roughly spherical, similarly-sized clusters; violations produce unintuitive results. Despite its limitations, k-means remains indispensable as a preprocessing step, feature engineering tool, and baseline against which more sophisticated methods are measured.
Related terms: Hierarchical Clustering, DBSCAN
Discussed in:
- Chapter 8: Unsupervised Learning — K-Means Clustering
Also defined in: Textbook of AI