8.16 Evaluation of unsupervised methods
Without labels, evaluation is harder than in supervised learning, but several principled approaches exist.
8.16.1 Internal indices for clustering
Indices that score a clustering using only the data and the partition.
- Silhouette coefficient (§8.3.4): mean over points of $(b(i)-a(i))/\max(a(i),b(i))$.
- Davies-Bouldin index: $DB = \frac{1}{K}\sum_k\max_{j\neq k}\frac{s_k+s_j}{d_{kj}}$, where $s_k$ is intra-cluster scatter and $d_{kj}$ is inter-centroid distance. Lower is better.
- Calinski-Harabasz (variance ratio) index: $CH = \frac{\mathrm{tr}(B_K)/(K-1)}{\mathrm{tr}(W_K)/(n-K)}$, where $B_K$ is between-cluster dispersion and $W_K$ is within-cluster dispersion. Higher is better.
Internal indices reward "ball-shaped" clusters; they will favour k-means-like solutions even on data where DBSCAN finds the true structure.
8.16.2 External indices
When some ground-truth labels exist on a held-out subset, use:
- Adjusted Rand index (ARI): chance-corrected agreement between two partitions.
- Normalised mutual information (NMI): $\mathrm{NMI}(U,V)=2I(U;V)/(H(U)+H(V))$.
- Adjusted mutual information (AMI): chance-corrected NMI.
- Homogeneity, completeness, V-measure.
These compare a learned partition to a reference partition without requiring matched labels.
8.16.3 Reconstruction error and held-out likelihood
For PCA, autoencoders, density estimators: reconstruction error or held-out log-likelihood on a test set. This is the unsupervised analogue of test loss.
8.16.4 Downstream-task evaluation
The most pragmatic evaluation: feed the learned representations or clusters into a downstream supervised task and measure performance there. If your PCA, autoencoder or LDA representation improves a classifier on a held-out test set, the unsupervised method is doing useful work; if not, it is not. This is the standard for self-supervised learning (next section).
8.16.5 Stability and reproducibility
Run the algorithm with different random seeds, subsamples and initialisations. A "real" structure should be reproducible; one that vanishes under bootstrap is suspect. The prediction strength (Tibshirani & Walther 2005) and the Jaccard index across bootstrap samples (Hennig 2007) formalise this.