Unsupervised Learning: 8.18   Exercises

Dr Chris Paton

8.18 Exercises

Conceptual

8.1 Why is unsupervised learning harder to evaluate than supervised learning? Give three concrete failure modes.

8.2 Show that for a single Gaussian cluster in $\mathbb{R}^d$, the maximum-likelihood mean estimator is the empirical mean. State the form of the covariance estimator and indicate where the factor of $1/n$ versus $1/(n-1)$ comes from.

8.3 Prove that Lloyd's algorithm decreases the WCSS at every iteration in which the cluster assignments change.

8.4 Show that hard EM with isotropic Gaussian components and $\sigma^2\to 0$ recovers k-means.

8.5 Explain why k-means struggles with elongated clusters. Sketch a 2D dataset with two elongated clusters that would be split by k-means and merged by Ward's hierarchical clustering.

8.6 Describe a dataset for which single-linkage agglomerative clustering chains noise points into one large cluster but Ward's method does not.

8.7 What is the relationship between PCA and probabilistic PCA? In what limit do they coincide?

8.8 Show that, for centred data, the principal components from the SVD $X=U\Sigma V^{\top}$ are the columns of $V$. Express the explained variance in terms of the singular values.

8.9 Why does kernel PCA require centring the kernel matrix rather than the data? Give the formula for the centred Gram matrix.

8.10 State two reasons why a low-perplexity t-SNE plot can mislead an analyst into believing in clusters that are not present in the data.

8.11 Explain what the perplexity hyperparameter in t-SNE controls, in terms of the effective neighbourhood size around each point.

8.12 Describe how UMAP's loss function differs from t-SNE's KL loss, and why this contributes to UMAP preserving global structure better.

8.13 What is the manifold hypothesis? Give an example from medical imaging where it is plausible.

8.14 Compare DBSCAN, HDBSCAN and OPTICS on the criterion of "robustness to varying cluster densities."

8.15 Suppose your data has two well-separated clusters of size 50 and 5000. Which clustering method (k-means, Ward, DBSCAN, GMM, spectral) is most likely to merge them? Justify.

8.16 Why is the silhouette score biased towards spherical clusters? Give an example dataset where silhouette and visual inspection disagree.

8.17 What does the eigengap heuristic say about choosing $K$ in spectral clustering? In what situation does it fail?

8.18 Derive the EM updates for a mixture of Bernoulli distributions over binary vectors.

8.19 Why is identifiability a problem for mixture models? How does it affect interpretation of the components?

8.20 Sketch the LDA generative model as a plate diagram. Mark which variables are observed, which are latent, and which are hyperparameters.

Mathematical

8.21 Prove that the mean silhouette of a clustering is bounded by the silhouette of the optimal $K$-clustering of the same data.

8.22 Derive the M-step updates for a GMM with shared (tied) covariance $\boldsymbol{\Sigma}$ across components.

8.23 Show that maximising variance and minimising reconstruction error in PCA are equivalent.

8.24 Derive the maximum-likelihood estimator for $W$ and $\sigma^2$ in probabilistic PCA. Show how PPCA collapses to PCA as $\sigma^2\to 0$.

8.25 Compute the gradient of the t-SNE KL objective with respect to $\mathbf{y}^{(i)}$.

8.26 Show that the unnormalised graph Laplacian $L = D - W$ is positive semi-definite.

8.27 Derive the collapsed Gibbs sampler conditional for LDA, starting from the joint distribution.

8.28 For LOF, prove that for points drawn from the same locally uniform density, $\mathrm{LOF}\to 1$ as the sample size grows.

Computational

8.29 Implement k-means with k-means++ initialisation in NumPy. Reproduce the WCSS-vs-K elbow plot for a synthetic three-Gaussian dataset.

8.30 Implement EM for a 2-component diagonal-covariance GMM in 2D. Animate the iterations.

8.31 Use scikit-learn to cluster the iris dataset with k-means, GMM, hierarchical (Ward), DBSCAN and spectral. Report ARI against the true species labels.

8.32 Run PCA on the MNIST digits. Plot the cumulative variance explained as a function of $k$. How many components are needed for 95% variance?

8.33 Compare t-SNE and UMAP on MNIST 2D embeddings at three perplexities (5, 30, 100) and three n_neighbours (5, 15, 50). Comment on stability.

8.34 Train a denoising autoencoder on Fashion-MNIST. Compare reconstruction error of clean and noisy test inputs.

8.35 Run LDA with $K=20$ on a corpus of news articles. Report the top-10 words per topic and write one-sentence interpretations of five topics.

8.36 Implement isolation forest from scratch. Apply to a synthetic 2D dataset with 1% planted anomalies and compute AUROC.

8.37 Build a k-distance plot for the Mall Customers dataset and choose an $\varepsilon$ for DBSCAN. Compare the resulting clustering to k-means at $K=5$.

Solution sketches

Solution to 8.3. WCSS at iteration $t$ is $J_t = \sum_i\lVert\mathbf{x}^{(i)}-\boldsymbol{\mu}_{z^{(i)}_t}^{(t)}\rVert^2$. The assignment step at iteration $t+1$ chooses $z^{(i)}_{t+1}=\arg\min_k\lVert\mathbf{x}^{(i)}-\boldsymbol{\mu}_k^{(t)}\rVert^2$, which is the minimiser; thus $\sum_i\lVert\mathbf{x}^{(i)}-\boldsymbol{\mu}_{z^{(i)}_{t+1}}^{(t)}\rVert^2 \le J_t$. The update step replaces each centroid by the mean of its assigned points; for any cluster $C_k$, $\sum_{i\in C_k}\lVert\mathbf{x}^{(i)}-\mathbf{m}\rVert^2$ is minimised by $\mathbf{m} = \bar{\mathbf{x}}_{C_k}$, so the centroid update further weakly decreases the sum. Composing the two steps gives $J_{t+1}\le J_t$, with strict decrease if any assignment changes.

Solution to 8.4. With $\boldsymbol{\Sigma}_k=\sigma^2\mathbf{I}$, the responsibility $\gamma_{ik}\propto \pi_k\exp(-\lVert\mathbf{x}^{(i)}-\boldsymbol{\mu}_k\rVert^2/(2\sigma^2))$. As $\sigma^2\to 0$, $\gamma_{ik}\to 1$ for $k=\arg\min_j\lVert\mathbf{x}^{(i)}-\boldsymbol{\mu}_j\rVert^2$ and 0 otherwise (assuming a unique minimiser). The M-step becomes $\boldsymbol{\mu}_k = \bar{\mathbf{x}}_{C_k}$, the k-means centroid update. The mixing weights $\pi_k=N_k/n$ become irrelevant to the assignment because the exponential factor dominates.

Solution to 8.7. PPCA is a Gaussian latent-variable model with $\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_k)$ and $\mathbf{x}\mid\mathbf{z}\sim\mathcal{N}(W\mathbf{z}+\boldsymbol{\mu},\sigma^2\mathbf{I}_d)$. Marginalising $\mathbf{z}$, $\mathbf{x}\sim\mathcal{N}(\boldsymbol{\mu},WW^{\top}+\sigma^2\mathbf{I})$. The MLE for $W$ aligns its columns with the top-$k$ eigenvectors of the empirical covariance, scaled by $\sqrt{\lambda_i-\sigma^2}$. As $\sigma^2\to 0$, the scaling $\sqrt{\lambda_i}$ recovers PCA's principal directions exactly. PPCA additionally provides a likelihood (allowing model selection by BIC) and a generative process (sample $\mathbf{x}$ via $\mathbf{z}$).

Solution to 8.8. $X = U\Sigma V^{\top}$ with $X$ centred. The empirical covariance is $S = X^{\top}X/n = V(\Sigma^2/n)V^{\top}$, which is $S$'s eigendecomposition with eigenvectors $V$ and eigenvalues $\sigma_i^2/n$. The variance along the $i$-th principal direction is $\sigma_i^2/n$. The total variance is $\mathrm{tr}(S)=\sum_i\sigma_i^2/n$, so the proportion explained by the first $k$ components is $(\sum_{i=1}^{k}\sigma_i^2)/(\sum_{i=1}^{d}\sigma_i^2)$.

Solution to 8.9. Centring $\phi(\mathbf{x}^{(i)})$ to have zero mean in feature space corresponds to subtracting $\bar\phi=\frac{1}{n}\sum_j\phi(\mathbf{x}^{(j)})$ from every $\phi(\mathbf{x}^{(i)})$. The centred kernel entries are $\tilde K_{ij} = (\phi(\mathbf{x}^{(i)})-\bar\phi)^{\top}(\phi(\mathbf{x}^{(j)})-\bar\phi) = K_{ij} - \frac{1}{n}\sum_k K_{ik} - \frac{1}{n}\sum_k K_{kj} + \frac{1}{n^2}\sum_{k\ell}K_{k\ell}$. In matrix form, $\tilde K = (I-\tfrac{1}{n}\mathbf{1}\mathbf{1}^{\top})K(I-\tfrac{1}{n}\mathbf{1}\mathbf{1}^{\top})$.

Solution to 8.18. Mixture of $K$ Bernoullis: $p(\mathbf{x};\boldsymbol{\theta})=\sum_k\pi_k\prod_{j=1}^{d}\mu_{kj}^{x_j}(1-\mu_{kj})^{1-x_j}$. E-step: $\gamma_{ik}=\pi_k\prod_j\mu_{kj}^{x_j^{(i)}}(1-\mu_{kj})^{1-x_j^{(i)}}/Z_i$. M-step: $\pi_k=\sum_i\gamma_{ik}/n$, $\mu_{kj}=\sum_i\gamma_{ik}x_j^{(i)}/\sum_i\gamma_{ik}$. The M-step is just weighted maximum-likelihood for a Bernoulli per dimension and component.

Solution to 8.22. With shared $\boldsymbol{\Sigma}$, the M-step over $\boldsymbol{\mu}_k$ and $\pi_k$ is unchanged. For $\boldsymbol{\Sigma}$, take the derivative of the expected complete-data log-likelihood:

$$ Q(\boldsymbol{\Sigma}) = -\tfrac{1}{2}\sum_{i,k}\gamma_{ik}\bigl[\log\lvert\boldsymbol{\Sigma}\rvert + (\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)^{\top}\boldsymbol{\Sigma}^{-1}(\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)\bigr] + \text{const}. $$

Setting $\partial Q/\partial\boldsymbol{\Sigma}^{-1}=0$ and using $\sum_{ik}\gamma_{ik}=n$, the maximiser is

$$ \boldsymbol{\Sigma} = \frac{1}{n}\sum_{i,k}\gamma_{ik}(\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)(\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)^{\top}. $$

This is a soft within-cluster scatter matrix.

Solution to 8.23. Let $W\in\mathbb{R}^{d\times k}$ have orthonormal columns. The reconstruction is $WW^{\top}\tilde{\mathbf{x}}^{(i)}$ and the squared error is $\lVert\tilde{\mathbf{x}}^{(i)}\rVert^2 - \lVert W^{\top}\tilde{\mathbf{x}}^{(i)}\rVert^2$. Summing and dividing by $n$, the average squared error is $\mathrm{tr}(S)-\mathrm{tr}(W^{\top}SW)$. Maximising the second term (variance of projection) is equivalent to minimising the first (reconstruction error), with the same maximiser: the top-$k$ eigenvectors of $S$.

Solution to 8.25. Let $f_{ij} = (1+\lVert\mathbf{y}^{(i)}-\mathbf{y}^{(j)}\rVert^2)^{-1}$, $Z=\sum_{k\neq\ell}f_{k\ell}$, $q_{ij}=f_{ij}/Z$. Differentiating $\mathrm{KL}(P\|Q)=\sum_{ij}p_{ij}\log p_{ij}-\sum_{ij}p_{ij}\log q_{ij}$ gives

$$ \frac{\partial\mathrm{KL}}{\partial\mathbf{y}^{(i)}} = 4\sum_j(p_{ij}-q_{ij})f_{ij}(\mathbf{y}^{(i)}-\mathbf{y}^{(j)}). $$

This is the t-SNE update: pull towards $j$ when $p_{ij}>q_{ij}$ (more attraction needed), push away when $p_{ij}\lt q_{ij}$.

Solution to 8.26. For any $\mathbf{f}\in\mathbb{R}^n$,

$$ \mathbf{f}^{\top}L\mathbf{f} = \mathbf{f}^{\top}D\mathbf{f}-\mathbf{f}^{\top}W\mathbf{f} = \sum_iD_{ii}f_i^2 - \sum_{ij}W_{ij}f_if_j = \tfrac{1}{2}\sum_{ij}W_{ij}(f_i-f_j)^2 \ge 0, $$

so $L\succeq 0$. Equality at $\mathbf{f}=\mathbf{1}$ (and all constants on connected components).

Solution to 8.27. Joint of LDA (omitting hyperparameters):

$$ p(\mathbf{w},\mathbf{z},\boldsymbol{\theta},\boldsymbol{\beta}) \propto \prod_d\Bigl[\prod_k\theta_{dk}^{\alpha-1}\Bigr]\prod_k\Bigl[\prod_w\beta_{kw}^{\eta-1}\Bigr]\prod_{d,n}\theta_{d,z_{dn}}\beta_{z_{dn},w_{dn}}. $$

Integrate out $\boldsymbol{\theta}_d$ via Dirichlet-multinomial conjugacy:

$$ \int\prod_k\theta_{dk}^{\alpha-1+n_{dk}}\,d\boldsymbol{\theta}_d \propto \frac{\prod_k\Gamma(\alpha+n_{dk})}{\Gamma(K\alpha+N_d)}, $$

and similarly for $\boldsymbol{\beta}_k$. The conditional $p(z_i=k\mid\mathbf{z}^{(-i)},\mathbf{w})$ is the ratio of the integrated joint with $z_i=k$ to its sum over $k$. Using $\Gamma(x+1)/\Gamma(x)=x$, the result simplifies to the product of two count fractions stated in §8.14.2.

Solution to 8.28. For points sampled from a uniform density on a small ball, the local reachability density $\mathrm{lrd}_k$ converges (as $n\to\infty$) to a constant proportional to that density. Hence the ratio $\mathrm{lrd}_k(\mathbf{y})/\mathrm{lrd}_k(\mathbf{x})$ converges to 1 in probability for $\mathbf{y}$ a neighbour of $\mathbf{x}$, and the average over neighbours is also 1. Therefore $\mathrm{LOF}_k(\mathbf{x})\to 1$.

Solution to 8.32. Sketch: load MNIST, flatten to $784$ features, fit PCA(n_components=784), plot np.cumsum(pca.explained_variance_ratio_). The 95% threshold is reached at approximately 154 components; the 90% threshold at 87; the 99% threshold at 331. The first two components explain ~17% of variance; visualising the 2D projection shows partial digit clustering (0s and 1s separable; 4/7/9 confused).

Solution to 8.36. Pseudocode:

def isolation_forest(X, n_trees=100, sample=256):
    trees = []
    for _ in range(n_trees):
        idx = rng.choice(len(X), sample, replace=False)
        trees.append(build_itree(X[idx], current_depth=0, max_depth=ceil(log2(sample))))
    return trees

def build_itree(X, current_depth, max_depth):
    if current_depth >= max_depth or len(X) <= 1: return Leaf(len(X))
    j = rng.randint(d); v = rng.uniform(X[:,j].min(), X[:,j].max())
    L, R = X[X[:,j] < v], X[X[:,j] >= v]
    return Node(j, v, build_itree(L, ...), build_itree(R, ...))

def path_length(x, tree, depth=0):
    if isinstance(tree, Leaf): return depth + c(tree.size)
    return path_length(x, tree.left if x[tree.j] < tree.v else tree.right, depth+1)

The score $s(\mathbf{x},n)=2^{-E[h(\mathbf{x})]/c(n)}$ is computed by averaging path lengths across trees. AUROC on planted anomalies typically exceeds 0.95 even at 1% contamination.