K-Means, Glossary, Textbook of AI

K-means partitions $N$ data points $\{x_n\}_{n=1}^N$ into $K$ clusters by minimising the within-cluster sum of squares (WCSS):

$$\min_{\{\mu_k\}, \{r_{nk}\}} \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|x_n - \mu_k\|^2$$

where $r_{nk} \in \{0, 1\}$ is a hard cluster assignment ($\sum_k r_{nk} = 1$) and $\mu_k$ is the centroid of cluster $k$.

Lloyd's algorithm (1957, the standard k-means procedure):

Initialise $K$ centroids $\mu_k$ (randomly from the data, or by k-means++).
Assignment step: assign each point to its nearest centroid:

$$r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|x_n - \mu_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$
Update step: recompute each centroid as the mean of its assigned points:

$$\mu_k = \frac{\sum_n r_{nk} x_n}{\sum_n r_{nk}}$$
Repeat 2–3 until assignments stop changing.

Convergence: each step is guaranteed to monotonically decrease the WCSS objective, and there are finitely many possible assignments, so the algorithm converges in a finite number of steps. The result is a local minimum, k-means is sensitive to initialisation and typically requires multiple random restarts.

K-means++ initialisation (Arthur & Vassilvitskii 2007): choose the first centroid uniformly at random; choose each subsequent centroid with probability proportional to $D(x)^2$, where $D(x)$ is the distance from $x$ to its nearest already-chosen centroid. Provably gives an $O(\log K)$-approximation to the optimal WCSS in expectation, and dramatically better empirical results than uniform initialisation.

Choosing K:

Elbow method: plot WCSS vs $K$; the "elbow" of the curve heuristically suggests a good $K$.
Silhouette score: average $(b - a)/\max(a, b)$ where $a$ is mean intra-cluster distance, $b$ mean nearest-cluster distance.
Bayesian Information Criterion, Gap statistic.

Time complexity: $O(N K d)$ per iteration ($d$ feature dimensions). Mini-batch k-means processes batches at a time, scaling to enormous datasets.

Relationship to GMM: k-means is a special case of fitting a Gaussian mixture model with isotropic covariance $\Sigma_k = \sigma^2 I$ in the limit $\sigma \to 0$, with hard assignments replacing soft.

Modern uses:

Vector quantisation: quantise embeddings to a discrete codebook (used in VQ-VAE, audio coding).
Image segmentation: cluster pixel features.
Initialisation for other methods: warm-start GMM, spectral methods.
Approximate nearest neighbour search: hierarchical k-means tree (FAISS, ScaNN).

K-means is one of the simplest and most widely-deployed unsupervised learning algorithms, fast, effective in many settings, and a baseline against which more sophisticated methods are measured.

Interactive

k-means clustering, iteration by iteration. The alternating assign-and-update loop converges three centres into clusters.

Video

Related terms: Gaussian Mixture Model, Expectation–Maximisation

Discussed in:

Chapter 8: Unsupervised Learning, Unsupervised Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.