Also known as: GMM
A Gaussian mixture model (GMM) represents a distribution as a weighted sum of $K$ multivariate Gaussian components:
$$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$
with mixing weights $\pi_k \geq 0$, $\sum_k \pi_k = 1$, component means $\mu_k \in \mathbb{R}^d$, and covariances $\Sigma_k \in \mathbb{R}^{d \times d}$ (positive definite).
Generative interpretation: each data point is generated by first sampling a component $z \sim \mathrm{Categorical}(\pi)$, then $x \sim \mathcal{N}(\mu_z, \Sigma_z)$. The latent $z$ is the cluster assignment.
Maximum likelihood via EM:
E-step: compute posterior responsibilities
$$\gamma_{nk} = P(z_n = k | x_n) = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}$$
M-step: weighted maximum likelihood
$$N_k = \sum_n \gamma_{nk}$$ $$\pi_k = N_k / N$$ $$\mu_k = \frac{1}{N_k} \sum_n \gamma_{nk} x_n$$ $$\Sigma_k = \frac{1}{N_k} \sum_n \gamma_{nk} (x_n - \mu_k)(x_n - \mu_k)^\top$$
Practical considerations:
- Initialisation: random starting points or k-means++ seeds.
- Singularity: if a component's covariance collapses (e.g. if $\sigma \to 0$ around a single point), the likelihood diverges. Mitigated by regularising the covariance ($\Sigma_k + \lambda I$) or using a Bayesian prior.
- Choosing $K$: BIC, AIC, cross-validated log-likelihood, or Dirichlet-process variants (DPGMM) that adapt the number of components.
Variants:
- Diagonal-covariance GMM: $\Sigma_k = \mathrm{diag}(\sigma_k^2)$. Faster, fewer parameters; assumes feature independence within each cluster.
- Tied covariance: shared $\Sigma$ across components.
- Spherical covariance $\Sigma_k = \sigma_k^2 I$: limit recovers k-means.
- Variational GMM: full Bayesian treatment with Dirichlet prior on $\pi$ and Normal-Wishart prior on $(\mu, \Sigma)$, fit by variational inference.
Modern uses:
- Speech recognition (legacy HMM-GMM systems, displaced by deep neural networks ~2012).
- Anomaly detection: flag points with low GMM likelihood.
- Voice/speaker modelling, image segmentation, biological gene-expression clustering.
- Variational latent space modelling: GMM priors in VAEs encourage clustering structure in the latent space.
- Deep Mahalanobis OOD detection: GMM in the feature space of a trained network distinguishes in-distribution from out-of-distribution inputs.
Video
Related terms: Expectation–Maximisation, K-Means
Discussed in:
- Chapter 8: Unsupervised Learning, Unsupervised Learning