Unsupervised Learning: 8.2   Density estimation

Dr Chris Paton

8.2 Density estimation

The most direct version of the unsupervised question is to estimate the data-generating density itself: given a sample $\mathcal{D}=\{\mathbf{x}^{(1)},\dots,\mathbf{x}^{(n)}\}$ drawn from an unknown $p_{\text{data}}(\mathbf{x})$ on $\mathbb{R}^d$, produce a function $\hat p(\mathbf{x})$ that we can evaluate at any new point and, ideally, sample from. Once we have $\hat p$, almost every other unsupervised question, clustering, anomaly detection, generation, even the Bayes-optimal classifier $p(y\mid\mathbf{x})\propto p(\mathbf{x}\mid y)p(y)$, follows by post-processing.

Symbols Used Here

$p_{\text{data}}$the unknown true density

$\hat p$our estimate

$h$bandwidth (kernel width or histogram bin width)

$K(\cdot)$kernel function, $\int K = 1$

Parametric versus non-parametric estimators

The two ends of the spectrum trade flexibility against statistical efficiency. A parametric estimator commits in advance to a family $\{p_\theta : \theta\in\Theta\}$ and fits the parameters by maximum likelihood. The simplest example is a single multivariate Gaussian: estimate $\hat\mu = \frac{1}{n}\sum_i\mathbf{x}^{(i)}$ and $\hat\Sigma = \frac{1}{n}\sum_i(\mathbf{x}^{(i)}-\hat\mu)(\mathbf{x}^{(i)}-\hat\mu)^\top$. With $n$ in the thousands and $d$ in the dozens, the parameter estimates are excellent, but they are only as good as the assumption that the data really is Gaussian. Real distributions are skewed, heavy-tailed, multimodal, supported on manifolds, or all of the above, and a misspecified parametric model fails confidently.

A non-parametric estimator makes no such commitment. The two classical recipes are:

Histograms. Partition $\mathbb{R}^d$ into bins of width $h$ and estimate $\hat p(\mathbf{x}) = (n h^d)^{-1}\,\#\{i : \mathbf{x}^{(i)}\text{ in same bin as }\mathbf{x}\}$. Simple, interpretable, but the discontinuities at bin boundaries are an artefact of the binning, and the curse of dimensionality bites hard: for $d=10$ and ten bins per dimension, the number of cells is $10^{10}$ and almost all of them are empty.
Kernel density estimation (Parzen, 1962; Rosenblatt, 1956). Smooth the histogram by placing a small kernel $K_h$ at every data point and averaging:

$$\hat p(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^n K_h(\mathbf{x} - \mathbf{x}^{(i)}),\qquad K_h(\mathbf{u}) = h^{-d}K(\mathbf{u}/h).$$

A standard Gaussian kernel $K(\mathbf{u}) = (2\pi)^{-d/2}\exp(-\tfrac{1}{2}\|\mathbf{u}\|^2)$ produces a smooth, continuous estimate. The bandwidth $h$ controls the bias–variance trade-off: small $h$ gives a spiky estimate that overfits the sample, large $h$ smooths real structure away. Silverman's rule of thumb, $h \approx 1.06\,\hat\sigma\, n^{-1/5}$ for univariate Gaussian-shaped data, is a defensible starting point; cross-validated maximum likelihood is the principled choice when the sample is large enough to afford it.

The curse of dimensionality

KDE works well in one or two dimensions and degrades rapidly above five or six. The reason is geometric: in high dimensions, almost every pair of sample points is roughly the same Euclidean distance apart, so the kernel either overlaps every point (estimate is uniform) or no other point (estimate is a sum of isolated spikes). Quantitatively, the optimal mean integrated squared error of KDE scales as $n^{-4/(4+d)}$; for $d=10$, matching the accuracy obtained from $n$ samples in one dimension requires on the order of $n^{14/5}$ total samples, that is, $n^{9/5}$ times as many. This is why density estimation in high dimensions almost always proceeds through structural assumptions: mixtures of Gaussians (§8.5), low-dimensional manifolds (§§8.10–8.13), or autoregressive and flow-based neural densities (Chapter 14).

Density estimation as an organising principle

Density estimation is the spine of the chapter even when it is not the named goal. Clustering with a Gaussian mixture is density estimation, with the latent component label $z$ providing the cluster assignment as a by-product. Probabilistic PCA is a Gaussian density model with low-rank covariance. LDA is a generative density on bag-of-words documents. Anomaly detection scores points by $-\log\hat p(\mathbf{x})$. Generative modelling, the topic of Chapter 14, simply takes the density-estimation problem and demands that the model also be cheap to sample from. The recurring theme is that almost every method in this chapter can be read either as a structural restriction on a density estimator or as a discriminative procedure that happens to share its assumptions with one.