Unsupervised Learning: 8.16   Evaluation of unsupervised methods

Dr Chris Paton

8.16 Evaluation of unsupervised methods

Without labels, evaluation is harder than in supervised learning, but several principled approaches exist.

8.16.1 Internal indices for clustering

Indices that score a clustering using only the data and the partition.

Silhouette coefficient (§8.3.4): mean over points of $(b(i)-a(i))/\max(a(i),b(i))$.
Davies-Bouldin index: $DB = \frac{1}{K}\sum_k\max_{j\neq k}\frac{s_k+s_j}{d_{kj}}$, where $s_k$ is intra-cluster scatter and $d_{kj}$ is inter-centroid distance. Lower is better.
Calinski-Harabasz (variance ratio) index: $CH = \frac{\mathrm{tr}(B_K)/(K-1)}{\mathrm{tr}(W_K)/(n-K)}$, where $B_K$ is between-cluster dispersion and $W_K$ is within-cluster dispersion. Higher is better.

Internal indices reward "ball-shaped" clusters; they will favour k-means-like solutions even on data where DBSCAN finds the true structure.

8.16.2 External indices

When some ground-truth labels exist on a held-out subset, use:

Adjusted Rand index (ARI): chance-corrected agreement between two partitions.
Normalised mutual information (NMI): $\mathrm{NMI}(U,V)=2I(U;V)/(H(U)+H(V))$.
Adjusted mutual information (AMI): chance-corrected NMI.
Homogeneity, completeness, V-measure.

These compare a learned partition to a reference partition without requiring matched labels.

8.16.3 Reconstruction error and held-out likelihood

For PCA, autoencoders, density estimators: reconstruction error or held-out log-likelihood on a test set. This is the unsupervised analogue of test loss.

8.16.4 Downstream-task evaluation

The most pragmatic evaluation: feed the learned representations or clusters into a downstream supervised task and measure performance there. If your PCA, autoencoder or LDA representation improves a classifier on a held-out test set, the unsupervised method is doing useful work; if not, it is not. This is the standard for self-supervised learning (next section).

8.16.5 Stability and reproducibility

Run the algorithm with different random seeds, subsamples and initialisations. A "real" structure should be reproducible; one that vanishes under bootstrap is suspect. The prediction strength (Tibshirani & Walther 2005) and the Jaccard index across bootstrap samples (Hennig 2007) formalise this.