K-means partitions $N$ data points $\{x_n\}_{n=1}^N$ into $K$ clusters by minimising the within-cluster sum of squares (WCSS):
$$\min_{\{\mu_k\}, \{r_{nk}\}} \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|x_n - \mu_k\|^2$$
where $r_{nk} \in \{0, 1\}$ is a hard cluster assignment ($\sum_k r_{nk} = 1$) and $\mu_k$ is the centroid of cluster $k$.
Lloyd's algorithm (1957, the standard k-means procedure):
Initialise $K$ centroids $\mu_k$ (randomly from the data, or by k-means++).
Assignment step: assign each point to its nearest centroid:
$$r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x_n - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$
Update step: recompute each centroid as the mean of its assigned points:
$$\mu_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}$$
Repeat 2–3 until assignments stop changing.
Convergence: each step is guaranteed to monotonically decrease the WCSS objective, and there are finitely many possible assignments, so the algorithm converges in a finite number of steps. The result is a local minimum, k-means is sensitive to initialisation and typically requires multiple random restarts.
K-means++ initialisation (Arthur & Vassilvitskii 2007): choose the first centroid uniformly at random; choose each subsequent centroid with probability proportional to $D(x)^2$, where $D(x)$ is the distance from $x$ to its nearest already-chosen centroid. Provably gives an $O(\log K)$-approximation to the optimal WCSS in expectation, and dramatically better empirical results than uniform initialisation.
Choosing K:
- Elbow method: plot WCSS vs $K$; the "elbow" of the curve heuristically suggests a good $K$.
- Silhouette score: average $(b - a)/\max(a, b)$ where $a$ is mean intra-cluster distance, $b$ mean nearest-cluster distance.
- Bayesian Information Criterion, Gap statistic.
Time complexity: $O(N K d)$ per iteration ($d$ feature dimensions). Mini-batch k-means processes batches at a time, scaling to enormous datasets.
Relationship to GMM: k-means is a special case of fitting a Gaussian mixture model with isotropic covariance $\Sigma_k = \sigma^2 I$ in the limit $\sigma \to 0$, with hard assignments replacing soft.
Modern uses:
- Vector quantisation: quantise embeddings to a discrete codebook (used in VQ-VAE, audio coding).
- Image segmentation: cluster pixel features.
- Initialisation for other methods: warm-start GMM, spectral methods.
- Approximate nearest neighbour search: hierarchical k-means tree (FAISS, ScaNN).
K-means is one of the simplest and most widely-deployed unsupervised learning algorithms, fast, effective in many settings, and a baseline against which more sophisticated methods are measured.
Interactive
Video
Related terms: Gaussian Mixture Model, Expectation–Maximisation
Discussed in:
- Chapter 8: Unsupervised Learning, Unsupervised Learning