8.3 K-means clustering
The price of the simplicity is a set of assumptions that, when violated, produce silently misleading output. K-means presumes that clusters are roughly spherical, similarly sized, and well separated under the Euclidean metric. It demands that you supply the number of clusters in advance, and it is sensitive to the random starting positions of its centres. None of these limitations is fatal, workarounds exist for each, but ignoring them is a recipe for spurious findings. The discipline of using k-means well is the discipline of knowing when its assumptions hold and what to do when they do not.
Where density estimation asks "what does the data distribution look like as a smooth function?", clustering asks "into how many groups does the data fall, and which point belongs to which?" K-means takes the most direct answer to that question and turns it into an algorithm.
The algorithm (Lloyd's algorithm, 1957)
The algorithm at the heart of k-means was first described in a 1957 internal Bell Labs report by Stuart Lloyd, in the context of pulse-code modulation: how should one place a finite set of reproduction levels so that an analogue signal, quantised to the nearest level, suffers the smallest possible distortion? The same problem, transposed from one dimension to many and from "signal levels" to "cluster centres", is the clustering problem. Lloyd's report circulated informally for two and a half decades before its formal publication in 1982; in parallel, Edward Forgy described an essentially identical procedure in a 1965 abstract, and James MacQueen coined the name "k-means" in his 1967 paper. The signal-processing community knows it as the generalised Lloyd algorithm or vector quantisation; the statistics community knows it as k-means. The recipe is the same.
The procedure is iterative.
Initialise $K$ cluster centres $\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K \in \mathbb{R}^d$. The simplest choice is to pick $K$ points uniformly at random from the data; better choices, especially k-means++, are discussed below.
Repeat until assignments stop changing:
Assignment step. For every point $\mathbf{x}_i$, compute $$ c_i = \arg\min_{k \in \{1, \ldots, K\}} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|_2^2. $$ Each point joins the cluster of its nearest centre.
Update step. For every cluster $k$ with at least one assigned point, set $$ \boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{i \in C_k} \mathbf{x}_i. $$ Each centre is replaced by the arithmetic mean of its assigned points.
Terminate when no point's assignment changes during the assignment step (equivalently, when the centres stop moving).
A small subtlety: if a cluster ends up empty after an assignment step, the corresponding update is undefined. The standard remedy is to reinitialise the empty cluster to the data point furthest from any current centre, which guarantees forward progress.
The crucial property of this procedure is that each iteration monotonically decreases the within-cluster sum of squares, $$ J(\mathbf{c}, \boldsymbol{\mu}) = \sum_{i=1}^{n} \|\mathbf{x}_i - \boldsymbol{\mu}_{c_i}\|_2^2 = \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|_2^2. $$ The argument is a textbook example of coordinate descent. Holding the centres fixed and minimising over assignments is exactly the assignment step: the optimal $c_i$ is the index of the nearest centre, by definition. Holding the assignments fixed and minimising over centres is exactly the update step: the sample mean is the unique minimiser of the sum of squared deviations from a single point. Each step is the exact minimiser of $J$ in its block of variables, so $J$ cannot increase. Because $J \ge 0$ and there are only finitely many possible assignments ($K^n$), the sequence of $J$ values is bounded below and strictly decreasing whenever an assignment changes; therefore the algorithm terminates in finitely many iterations. It is guaranteed to find a local minimum, but not the global one, a point we shall return to.
Two further remarks. First, the cost per iteration is $O(nKd)$: every point compares itself to every centre, and each comparison computes a $d$-dimensional squared distance. Modern implementations exploit the triangle inequality (Elkan's algorithm) to skip many of those comparisons, often achieving order-of-magnitude speedups when $K$ is large. Second, the number of iterations to convergence is typically tens, not thousands; in practice k-means is bottlenecked by the per-iteration cost rather than by iteration count.
Worked example
Concrete arithmetic clarifies the bookkeeping. Take six points in $\mathbb{R}^2$: $$ \mathbf{x}_1 = (1, 1),\; \mathbf{x}_2 = (1, 2),\; \mathbf{x}_3 = (2, 1),\; \mathbf{x}_4 = (10, 10),\; \mathbf{x}_5 = (10, 11),\; \mathbf{x}_6 = (11, 10). $$ Visually, three points sit near the origin and three sit near $(10, 10)$. Set $K = 2$ and initialise the centres at the two extreme points: $\boldsymbol{\mu}_1^{(0)} = (1, 1)$ and $\boldsymbol{\mu}_2^{(0)} = (10, 10)$.
Iteration 1, assignment step. Compute the squared Euclidean distance from each point to each centre.
| Point | $\|\mathbf{x}_i - \boldsymbol{\mu}_1^{(0)}\|^2$ | $\|\mathbf{x}_i - \boldsymbol{\mu}_2^{(0)}\|^2$ | Nearest |
|---|---|---|---|
| $(1, 1)$ | $0$ | $162$ | $\boldsymbol{\mu}_1$ |
| $(1, 2)$ | $1$ | $145$ | $\boldsymbol{\mu}_1$ |
| $(2, 1)$ | $1$ | $145$ | $\boldsymbol{\mu}_1$ |
| $(10, 10)$ | $162$ | $0$ | $\boldsymbol{\mu}_2$ |
| $(10, 11)$ | $181$ | $1$ | $\boldsymbol{\mu}_2$ |
| $(11, 10)$ | $181$ | $1$ | $\boldsymbol{\mu}_2$ |
So $C_1 = \{1, 2, 3\}$ and $C_2 = \{4, 5, 6\}$, as expected.
Iteration 1, update step. Recompute the centres as the means of the assigned points. $$ \boldsymbol{\mu}_1^{(1)} = \tfrac{1}{3}\bigl((1,1) + (1,2) + (2,1)\bigr) = \tfrac{1}{3}(4, 4) \approx (1.33, 1.33). $$ $$ \boldsymbol{\mu}_2^{(1)} = \tfrac{1}{3}\bigl((10,10) + (10,11) + (11,10)\bigr) = \tfrac{1}{3}(31, 31) \approx (10.33, 10.33). $$
Iteration 2, assignment step. Each point in $C_1$ remains nearer to $\boldsymbol{\mu}_1^{(1)} \approx (1.33, 1.33)$ than to $\boldsymbol{\mu}_2^{(1)} \approx (10.33, 10.33)$, and vice versa for $C_2$. No assignment changes, so the algorithm has converged.
The within-cluster sum of squares at convergence is $$ J = \underbrace{\bigl(\tfrac{2}{9} + \tfrac{5}{9} + \tfrac{5}{9}\bigr)}_{\text{cluster 1}} + \underbrace{\bigl(\tfrac{2}{9} + \tfrac{5}{9} + \tfrac{5}{9}\bigr)}_{\text{cluster 2}} \approx 1.33 + 1.33 = 2.67, $$ a reassuringly small number relative to the inter-cluster distance of roughly 12.7. Even for this tiny example, two facts that recur on real data are visible. First, the procedure converged in one iteration: when the data has a clear cluster structure and the initialisation is reasonable, k-means is fast. Second, the final centres are not data points; they are averages of data points. This will become important when we discuss k-medoids, which restricts cluster centres to be actual observations and is therefore more robust to outliers.
A useful exercise: rerun the example with the initialisation $\boldsymbol{\mu}_1^{(0)} = (1, 1)$ and $\boldsymbol{\mu}_2^{(0)} = (1, 2)$. Both initial centres lie inside the lower-left cluster, and yet Lloyd's algorithm still finds the natural answer: in iteration 1 the lower-left points split (closest to whichever centre they happen to start near) and the upper-right points all attach to $\boldsymbol{\mu}_2$, dragging $\boldsymbol{\mu}_2$ towards $(8, 8.25)$; iteration 2 then rebalances the lower-left points back into a single cluster around $(4/3, 4/3)$. The pathology this paragraph claimed was inevitable does not bite when clusters are this widely separated, Lloyd's algorithm recovers from almost any reasonable seed in one or two passes. Initialisation matters most when clusters are closer together, when there are more than two of them, or when high-dimensional structure makes "closest centre" a noisy signal. The next subsection covers k-means++, which mitigates these failure modes by spreading the seeds out by construction.
Initialisation: k-means++
Lloyd's algorithm finds a local minimum of the WCSS; whether that local minimum is also a good global answer depends almost entirely on where the centres start. The naive strategy of picking $K$ points uniformly at random can place two centres inside the same true cluster, leaving another true cluster with no centre nearby. Lloyd's iterations cannot recover from this kind of mistake: a centre that has wandered into the wrong cluster is locked there by the assignment step.
k-means++, introduced by David Arthur and Sergei Vassilvitskii in 2007, repairs the failure mode by spreading the initial centres out. Pick the first centre $\boldsymbol{\mu}_1$ uniformly at random from the data. For each subsequent centre $\boldsymbol{\mu}_k$ (for $k = 2, \ldots, K$), compute, for every data point $\mathbf{x}_i$, the squared distance $D(\mathbf{x}_i)^2$ to the nearest already-chosen centre. Sample the next centre from the data with probability proportional to $D(\mathbf{x}_i)^2$. Points that are far from all current centres are exponentially more likely to be chosen than nearby points, so the seeded centres tend to spread across the data.
Three properties make k-means++ the standard. First, it is cheap: each centre is sampled in $O(n)$ time after a single pass to compute distances, so the total seeding cost is $O(nK)$, dominated by the cost of one Lloyd iteration. Second, it has a provable guarantee. Arthur and Vassilvitskii showed that the expected WCSS of the seeding alone, before any Lloyd iterations, satisfies $\mathbb{E}[J] \le 8(\ln K + 2) J^*$, where $J^*$ is the global optimum. The bound is loose in absolute terms but tight in the dependence on $K$, and it holds in expectation over the algorithm's randomness regardless of the input. Third, in practice the improvement over uniform random seeding is so reliable that scikit-learn, MATLAB, R's kmeans and TensorFlow's tf.compat.v1.estimator.experimental.KMeans all use k-means++ by default.
A common pattern in production is to combine k-means++ with multiple restarts: run the entire procedure (seed plus Lloyd iterations) ten times with different random seeds and keep the run with the smallest final WCSS. The cost is linear in the number of restarts, the variance reduction is substantial, and the worst-case behaviour, a single unlucky seed, is eliminated.
How to choose K
The single hyperparameter of k-means is the number of clusters $K$. Choosing it well is part art, part science.
Elbow method. Run k-means for a range of values of $K$ (say, 1 through 15) and plot the resulting WCSS against $K$. Because adding a cluster can only decrease $J$, the curve is monotonically non-increasing. If the data has $K^*$ natural clusters, $J$ tends to fall steeply for $K < K^*$ as new centres capture genuine structure, and to flatten for $K > K^*$ as further centres only carve up existing clusters. The "elbow" of the curve, the kink between the steep and the flat regimes, is the heuristic choice. The method is intuitive and often works, but the elbow can be ambiguous, especially when clusters have different scales.
Silhouette score. For each point $i$, compute $a(i)$, the mean distance from $i$ to other points in its own cluster, and $b(i)$, the mean distance from $i$ to points in the nearest other cluster. The silhouette of the point is $s(i) = (b(i) - a(i)) / \max(a(i), b(i)) \in [-1, 1]$. Values near 1 mean the point is much closer to its own cluster than to any other; values near 0 mean the point sits on a boundary; negative values mean the point is closer to a neighbouring cluster than to its own. Average $s(i)$ over the data and pick the $K$ that maximises the average. Silhouette is more principled than the elbow but expensive ($O(n^2)$ for naive distance computation) and biased towards spherical clusters.
Information criteria. If we treat k-means as a vanishing-variance Gaussian mixture, we can compute the Bayesian Information Criterion (BIC) for each $K$ and pick the smallest. BIC penalises model complexity ($K$ centres mean $Kd$ free parameters) against likelihood, and tends to give a more conservative answer than the elbow.
Gap statistic. Tibshirani, Walther and Hastie (2001) compare $\log J(K)$ for the data to the same quantity computed on a uniform reference distribution over the data's bounding box; the optimal $K$ is the smallest one for which the gap exceeds the next gap minus a Monte Carlo standard error.
Domain knowledge. Often the best heuristic. If you are clustering customers and the marketing team has four campaign segments, $K = 4$ is the answer. If you are clustering chest X-rays into "normal", "consolidation", "effusion" and "pneumothorax", $K = 4$. Numerical heuristics dominate when the problem is exploratory, but most clustering in industry is constrained by what the cluster labels will be used for, and that constrains $K$ before any algorithm runs.
In practice, run k-means for a sensible range of $K$, look at the elbow plot and silhouette together, and overlay any domain constraints. If two or three values of $K$ all look defensible, run the downstream task (segmentation report, forecasting model, alert rule) on each and let downstream performance break the tie.
Strengths and weaknesses
K-means' strengths follow from its simplicity. Its time complexity per iteration is $O(nKid)$, where $i$ is the iteration count, linear in every quantity. It scales trivially to millions of points and thousands of dimensions, parallelises by sharding the data across cores or machines, and admits streaming variants for data that does not fit in memory. The implementation is twenty lines of NumPy. The output, a centroid per cluster and a label per point, is easy to consume by downstream code. As a baseline, k-means is fast to fit, fast to evaluate, and easy to explain to a non-technical stakeholder. These are not glamorous virtues, but they win deployment battles.
The weaknesses are equally familiar.
Assumes spherical clusters. The Voronoi partition induced by Euclidean centroids is a tiling of $\mathbb{R}^d$ by convex polyhedra. Elongated clusters (think: a long, thin cigar-shaped distribution) will be cut across their long axis. Concentric rings, two crescent moons, or any non-convex cluster shape lies outside the model class.
Assumes equal-size clusters. A small cluster of 50 points adjacent to a large cluster of 5,000 will often be absorbed: the centroid of the small cluster drifts towards the larger mass because that move reduces the total WCSS, even if it merges two semantically distinct groups.
Sensitive to scale. Euclidean distance treats all dimensions equally. A feature in metres dominates a feature in millimetres unless you standardise first.
Sensitive to outliers. A single far-away point pulls a centroid towards itself, distorting the cluster.
Suffers in high dimensions. The curse of dimensionality flattens distances, in $d \gg 30$, every pair of points has roughly the same Euclidean distance, and the assignment step becomes nearly random.
Requires $K$ in advance. And is sensitive to the choice (see above).
Sensitive to initialisation. k-means++ helps, but does not eliminate, the dependence on the random seed.
Hard assignments. Every point belongs to exactly one cluster, with no notion of uncertainty. Points on the boundary between two clusters get a binary label that hides their ambiguity.
The right mental model is that k-means is the correct algorithm for the very specific generative scenario of $K$ isotropic Gaussians of equal variance and equal weight, observed without noise, and a reasonable approximation when the data looks roughly like that. When the data does not look like that, you need a different tool.
Variants
A small ecosystem of variants address the limitations.
Mini-batch k-means (Sculley 2010) processes random subsets of the data per iteration, updating centres with a learning-rate-style decay. Roughly 100$\times$ faster than full-batch k-means on datasets larger than memory, with negligible loss of clustering quality.
Spherical k-means uses cosine distance instead of Euclidean, which normalises away the magnitude of each vector. Standard for document clustering, where the orientation of the term-frequency vector matters more than its length.
k-medoids (PAM, Partitioning Around Medoids) constrains each cluster centre to be an actual data point. Robust to outliers because medoids cannot be dragged towards a far-away point the way means can. Costs $O(n^2)$ per iteration in the original PAM formulation; modern variants like CLARANS and CLARA reduce this.
Bisecting k-means is a hierarchical variant: split the data into two clusters with k-means, recursively split the larger half, and continue until $K$ leaves are produced. Faster than full hierarchical clustering, and the resulting tree gives a multi-resolution view of the data.
Fuzzy c-means assigns soft membership weights $u_{ik} \in [0, 1]$ with $\sum_k u_{ik} = 1$, useful when boundary points should not be forced to commit.
DBSCAN, mean-shift, OPTICS, HDBSCAN abandon the centroid-based formulation in favour of density-based notions of a cluster. They handle non-spherical shapes and discover the number of clusters automatically. We meet DBSCAN in §8.6.
Gaussian mixture models (§8.5) are the probabilistic generalisation: each cluster is a full Gaussian with its own mean and covariance, fit by the EM algorithm, with soft posterior responsibilities $\gamma_{ik}$ replacing hard assignments. K-means is the limit of a GMM with isotropic covariances $\sigma^2 \mathbf{I}$ as $\sigma^2 \to 0$.
Where k-means is used in 2026
K-means' role has narrowed since its 1960s heyday, but the niches it occupies are central.
Image and audio compression (vector quantisation). Group similar codeword vectors and replace each with the centroid. The discrete codebook in VQ-VAE, the workhorse of modern audio language models and the backbone of generative image tokenisers like ImageBart, is exactly the centroid set of a k-means run on encoder outputs. Modern audio codecs, EnCodec, SoundStream, chain several VQ codebooks (residual vector quantisation), each one trained with online k-means on the residual of the previous codebook.
Customer segmentation and marketing analytics. The single most common business use. Cluster customers on transaction histories, RFM features, or learned embeddings to define personas. K-means' speed and the interpretability of "the centre of cluster 3 is a 37-year-old shopper who buys fortnightly" make it a natural fit.
Approximate nearest-neighbour search. FAISS-IVF and the broader inverted file index family use k-means to partition a billion-vector corpus into $\sqrt{n}$ Voronoi cells. At query time, only the nearest few cells are searched, giving a 100--1000$\times$ speedup over brute force at modest accuracy cost. Product quantisation, introduced by Jégou, Douze and Schmid (2011), runs separate k-means codebooks on subvectors and is the basis of every modern vector database under heavy load.
Initialisation and pre-training in vision. Self-supervised methods like DeepCluster, SwAV and DINO alternate between k-means clustering of feature embeddings and supervised training against the cluster pseudo-labels. The clustering of CLIP embeddings into discrete buckets is the basis of several modern text-to-image diffusion conditioning schemes.
Initialisation for richer methods. The standard initial centres for a Gaussian mixture model are the k-means centres. Many semi-supervised and active-learning pipelines begin with a k-means pass.
Exploratory data analysis. Whenever a dataset arrives without obvious structure, k-means at $K = 2, 3, 5, 10$ is the first thing most analysts try, before reaching for anything more sophisticated. The plots that come out of those runs frame every subsequent question.
What you should take away
K-means alternates an assignment step (nearest centre) with an update step (mean of assigned points), monotonically decreasing the within-cluster sum of squares to a local minimum that is finite-step reachable.
The algorithm assumes spherical, equally sized, well-separated clusters under Euclidean distance, and is sensitive to scale, outliers, the choice of $K$, and the random initialisation.
k-means++ seeding picks initial centres with probability proportional to squared distance from already-chosen centres, gives an $O(\log K)$-competitive guarantee in expectation, and is the universal default. Pair it with multiple restarts for extra robustness.
Choose $K$ by the elbow method, silhouette coefficient, BIC or gap statistic, but treat domain knowledge as the strongest signal, and report results across a range of $K$ when the data does not commit to a single answer.
K-means is rarely the best clustering, but it is almost always the first. Its modern niches, vector quantisation in VQ-VAE and audio codecs, FAISS-IVF and product quantisation in billion-scale retrieval, customer segmentation, and pre-training pipelines for vision, depend on its speed and simplicity. Any clustering claim, including yours, should be benchmarked against k-means before being believed.