Unsupervised Learning: 8.7   Spectral clustering

Dr Chris Paton

8.7 Spectral clustering

Spectral clustering reframes clustering as a graph partitioning problem and solves it with eigenvectors of the graph Laplacian. It excels on data where clusters are connected but not convex, the canonical "two moons" or "concentric rings" examples that defeat k-means.

8.7.1 Building the similarity graph

Given $\mathcal{D}=\{\mathbf{x}^{(i)}\}_{i=1}^{n}$, construct an undirected graph $G=(V,E,W)$ with $V=\{1,\dots,n\}$, edge weights $W_{ij}\ge 0$ encoding similarity. Common constructions:

$\varepsilon$-graph: $W_{ij}=1$ if $\lVert\mathbf{x}^{(i)}-\mathbf{x}^{(j)}\rVert\le\varepsilon$, else 0.
k-nearest-neighbour graph: $W_{ij}>0$ if $j$ is among the $k$ nearest neighbours of $i$ or vice versa.
Gaussian (RBF) graph: $W_{ij}=\exp(-\lVert\mathbf{x}^{(i)}-\mathbf{x}^{(j)}\rVert^2/(2\sigma^2))$.

The choice of graph and bandwidth $\sigma$ shapes the clustering as much as the algorithm itself.

8.7.2 The graph Laplacian

Let $D$ be the diagonal degree matrix $D_{ii}=\sum_j W_{ij}$. The unnormalised Laplacian is $L = D - W$. The symmetric normalised Laplacian is $L_{\text{sym}} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}WD^{-1/2}$. The random-walk Laplacian is $L_{\text{rw}} = D^{-1}L = I - D^{-1}W$.

Key properties:

$L$ is symmetric positive semi-definite; eigenvalues $0=\lambda_1\le\lambda_2\le\dots\le\lambda_n$.
The multiplicity of eigenvalue 0 equals the number of connected components of $G$, with eigenvectors the indicator vectors of those components.
For a "nearly disconnected" graph (well-separated clusters), the bottom $K$ eigenvalues are small and the corresponding eigenvectors approximate the cluster indicators.

8.7.3 Normalised cut and the relaxation

The normalised cut of a partition $V = A\cup\bar A$ is

$$ \mathrm{Ncut}(A,\bar A) = \frac{\mathrm{cut}(A,\bar A)}{\mathrm{vol}(A)} + \frac{\mathrm{cut}(A,\bar A)}{\mathrm{vol}(\bar A)},\quad \mathrm{cut}(A,\bar A) = \sum_{i\in A,j\in\bar A}W_{ij},\quad \mathrm{vol}(A) = \sum_{i\in A}D_{ii}. $$

Minimising $\mathrm{Ncut}$ over discrete partitions is NP-hard. The continuous relaxation, due to Shi and Malik, is to minimise the Rayleigh quotient

$$ \min_{\mathbf{f}\in\mathbb{R}^n} \frac{\mathbf{f}^{\top}L\mathbf{f}}{\mathbf{f}^{\top}D\mathbf{f}} $$

over $\mathbf{f}\perp D\mathbf{1}$. The minimiser is the second-smallest generalised eigenvector of $L\mathbf{f} = \lambda D\mathbf{f}$, equivalently of $L_{\text{rw}}$.

8.7.4 The algorithm (Ng-Jordan-Weiss)

Compute $L_{\text{sym}}$ from $W$.
Compute the bottom $K$ eigenvectors $\mathbf{u}_1,\dots,\mathbf{u}_K$ of $L_{\text{sym}}$. Stack them into $U\in\mathbb{R}^{n\times K}$.
Normalise rows of $U$ to unit length: $T_{ij}=U_{ij}/\sqrt{\sum_kU_{ik}^2}$.
Cluster the rows of $T$ with k-means into $K$ groups.
Assign data point $i$ to the cluster of row $i$.

8.7.5 The eigengap heuristic

Choose $K$ such that $\lambda_1,\dots,\lambda_K$ are small and $\lambda_{K+1}$ is much larger, the eigengap between $\lambda_K$ and $\lambda_{K+1}$ is large. This is the spectral analogue of the elbow plot.

import numpy as np
from sklearn.cluster import SpectralClustering

X = ...
sc = SpectralClustering(n_clusters=2, affinity="nearest_neighbors", n_neighbors=10).fit(X)
labels = sc.labels_

Spectral clustering is $O(n^3)$ in the eigendecomposition of dense $L$; sparse Lanczos solvers reduce this to $O(n)$ for nearest-neighbour graphs. Beyond a few tens of thousands of points, approximations (Nyström, randomised eigensolvers) are necessary.