8.16 Evaluation of unsupervised methods

Without labels, evaluation is harder than in supervised learning, but several principled approaches exist.

8.16.1 Internal indices for clustering

Indices that score a clustering using only the data and the partition.

  • Silhouette coefficient (§8.3.4): mean over points of $(b(i)-a(i))/\max(a(i),b(i))$.
  • Davies-Bouldin index: $DB = \frac{1}{K}\sum_k\max_{j\neq k}\frac{s_k+s_j}{d_{kj}}$, where $s_k$ is intra-cluster scatter and $d_{kj}$ is inter-centroid distance. Lower is better.
  • Calinski-Harabasz (variance ratio) index: $CH = \frac{\mathrm{tr}(B_K)/(K-1)}{\mathrm{tr}(W_K)/(n-K)}$, where $B_K$ is between-cluster dispersion and $W_K$ is within-cluster dispersion. Higher is better.

Internal indices reward "ball-shaped" clusters; they will favour k-means-like solutions even on data where DBSCAN finds the true structure.

8.16.2 External indices

When some ground-truth labels exist on a held-out subset, use:

  • Adjusted Rand index (ARI): chance-corrected agreement between two partitions.
  • Normalised mutual information (NMI): $\mathrm{NMI}(U,V)=2I(U;V)/(H(U)+H(V))$.
  • Adjusted mutual information (AMI): chance-corrected NMI.
  • Homogeneity, completeness, V-measure.

These compare a learned partition to a reference partition without requiring matched labels.

8.16.3 Reconstruction error and held-out likelihood

For PCA, autoencoders, density estimators: reconstruction error or held-out log-likelihood on a test set. This is the unsupervised analogue of test loss.

8.16.4 Downstream-task evaluation

The most pragmatic evaluation: feed the learned representations or clusters into a downstream supervised task and measure performance there. If your PCA, autoencoder or LDA representation improves a classifier on a held-out test set, the unsupervised method is doing useful work; if not, it is not. This is the standard for self-supervised learning (next section).

8.16.5 Stability and reproducibility

Run the algorithm with different random seeds, subsamples and initialisations. A "real" structure should be reproducible; one that vanishes under bootstrap is suspect. The prediction strength (Tibshirani & Walther 2005) and the Jaccard index across bootstrap samples (Hennig 2007) formalise this.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).