Unsupervised Learning: 8.5   Gaussian mixture models and EM

Dr Chris Paton

8.5 Gaussian mixture models and EM

A Gaussian mixture model (GMM) takes seriously the idea that a dataset may be generated by several distinct populations rather than one. We picture each data point as drawn from one of $K$ underlying Gaussian distributions, but we never get to see which Gaussian generated which point. The Gaussians may differ in their centres, in their spreads, and in how often they fire. The job of fitting the model is to recover all of those parameters at once and, as a by-product, to attach to every point a probability of belonging to each cluster. That is what we mean by soft probabilistic clustering: instead of declaring that point $i$ is in cluster $k$ and nowhere else, we declare that it is in cluster $k$ with probability $\gamma_{ik}$, and these probabilities sum to one across clusters.

The fitting procedure is the Expectation–Maximisation (EM) algorithm. EM does not maximise the likelihood directly; it alternates between guessing cluster memberships from the current parameters (the E-step) and re-estimating parameters using those soft memberships as weights (the M-step). The two steps interlock and each iteration provably weakly increases the log-likelihood. EM is a general schema for problems with hidden variables, and the GMM is its most pedagogical instance.

Where hierarchical clustering of §8.4 built a tree by greedy merges with no probabilistic model and no notion of cluster shape, this section gives a fully generative story: the data comes from a mixture, the latent labels are unobserved indicators, and the parameters are inferred by maximum likelihood. K-means turns out to be a degenerate special case: what k-means does with hard, equal-sized circles, GMMs do with soft, unequal-sized ellipses.

This generative framing carries three concrete advantages over the methods of §8.3 and §8.4. First, because the model assigns a probability density to every point in the input space, it can serve double duty as a density estimator: we can ask "how likely is this new observation under the current model?", a question hierarchical clustering and k-means cannot answer. Second, because each component has its own covariance matrix, the model can capture anisotropic, rotated cluster shapes: long thin populations, populations stretched along an oblique axis, populations of very different scales. Third, because EM is a special case of variational inference, the GMM is a natural stepping-stone to the wider machinery of probabilistic latent-variable models. Variational autoencoders, latent Dirichlet allocation and hidden Markov models all use the same alternation between an inference step over latents and an update step over parameters.

Symbols Used Here

$\boldsymbol{\pi}_k$mixture weight (probability of cluster $k$)

$\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k$cluster mean, covariance

$\gamma_{ik}$responsibility (probability point $i$ belongs to cluster $k$)

$K$number of clusters

The model

A Gaussian mixture says that each observation $\mathbf{x}\in\mathbb{R}^d$ has the marginal density

$$ p(\mathbf{x}) = \sum_{k=1}^{K}\pi_k\,\mathcal{N}(\mathbf{x};\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k), $$

where the mixture weights satisfy $\pi_k\ge 0$ and $\sum_{k=1}^{K}\pi_k=1$. The weights are themselves a categorical distribution over which Gaussian fired. Each mixture component is parameterised by a mean vector $\boldsymbol{\mu}_k\in\mathbb{R}^d$ and a positive-definite covariance matrix $\boldsymbol{\Sigma}_k\in\mathbb{R}^{d\times d}$. The full parameter set is $\boldsymbol{\theta}=\{\pi_k,\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k\}_{k=1}^{K}$.

It is helpful to think of the model as a two-stage generative procedure. First, draw a latent indicator $z\sim\mathrm{Categorical}(\boldsymbol{\pi})$, choosing component $k$ with probability $\pi_k$. Second, given $z=k$, draw $\mathbf{x}\sim\mathcal{N}(\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)$. The latent variable $z$ is the "true" cluster label. We never observe $z$; we observe only $\mathbf{x}$. Marginalising out $z$ recovers the mixture density above.

The marginal density is a weighted sum of bell curves. In one dimension it can take many shapes, unimodal, bimodal, skewed, heavy-tailed, depending on how the means, variances and weights are arranged. In higher dimensions each component is an ellipsoid whose orientation, axis lengths and centre are encoded in $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$. This is why a GMM can fit datasets that k-means, with its implicit isotropic spheres, struggles with: ellipsoids can rotate and stretch.

The likelihood of the observed dataset $\mathcal{D}=\{\mathbf{x}^{(1)},\dots,\mathbf{x}^{(n)}\}$ under this model is

$$ \mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n}\sum_{k=1}^{K}\pi_k\,\mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k), $$

and the log-likelihood is

$$ \ell(\boldsymbol{\theta}) = \sum_{i=1}^{n}\log\sum_{k=1}^{K}\pi_k\,\mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k). $$

The presence of the inner sum inside the logarithm, the log-of-sum, is the technical obstacle. In a single Gaussian, taking the log puts everything into a clean quadratic form and the maximum-likelihood estimates fall out by inspection. For a mixture, the log cannot be pushed past the sum. There is no closed-form solution, and direct gradient ascent on $\ell$ is brittle because of the $K!$ symmetric local optima created by relabelling components. The EM algorithm sidesteps this difficulty by introducing the latent variables explicitly and optimising over $\boldsymbol{\theta}$ and a distribution over $z$ jointly.

A few useful identifiability remarks. The mixture is invariant under permutations of the component indices, so the likelihood landscape always has at least $K!$ equivalent global optima. For generic problems we therefore post-process to a canonical ordering, e.g. by sorting components by $\pi_k$ or by $\|\boldsymbol{\mu}_k\|$. Mixtures are also non-identifiable in pathological cases (for instance, if $\pi_k=0$ for some $k$, the corresponding mean and covariance are unconstrained), but for any sensible initialisation these edge cases do not bite.

EM algorithm

The EM algorithm fits a GMM by alternating two steps until the log-likelihood converges. Both steps have closed-form formulas, which is what makes EM such a satisfying procedure to implement and to teach.

E-step. Holding the current parameters $\boldsymbol{\theta}^{(t)}$ fixed, compute, for every data point $i$ and every component $k$, the responsibility

$$ \gamma_{ik} = \frac{\pi_k\,\mathcal{N}(\mathbf{x}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K}\pi_j\,\mathcal{N}(\mathbf{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}. $$

The numerator is the joint probability that the latent label is $k$ and that we then drew $\mathbf{x}_i$. The denominator is the marginal probability of $\mathbf{x}_i$. Their ratio is the posterior probability $p(z_i=k\mid\mathbf{x}_i;\boldsymbol{\theta}^{(t)})$, the model's current guess at how much "credit" cluster $k$ deserves for having generated point $i$. By construction $\sum_{k=1}^{K}\gamma_{ik}=1$ for every $i$.

M-step. Treating the responsibilities as soft labels, fractional indicators of class membership, re-estimate the parameters of each Gaussian as a weighted maximum-likelihood fit. Define the effective count

$$ N_k = \sum_{i=1}^{n}\gamma_{ik}. $$

Then update

$$ \boldsymbol{\mu}_k \;=\; \frac{1}{N_k}\sum_{i=1}^{n}\gamma_{ik}\,\mathbf{x}_i, $$ $$ \boldsymbol{\Sigma}_k \;=\; \frac{1}{N_k}\sum_{i=1}^{n}\gamma_{ik}(\mathbf{x}_i-\boldsymbol{\mu}_k)(\mathbf{x}_i-\boldsymbol{\mu}_k)^{\top}, $$ $$ \pi_k \;=\; \frac{N_k}{n}. $$

Each formula is the weighted analogue of an unweighted maximum-likelihood estimate: $\boldsymbol{\mu}_k$ is a weighted average of the data, $\boldsymbol{\Sigma}_k$ is a weighted sample covariance about that mean, and $\pi_k$ is the fraction of total responsibility assigned to component $k$.

Repeat the two steps until $\ell(\boldsymbol{\theta}^{(t+1)}) - \ell(\boldsymbol{\theta}^{(t)})$ falls below a tolerance such as $10^{-6}$, or until a maximum number of iterations is reached.

Why does it work? Introduce, for each $i$, an arbitrary distribution $q_i(k)$ over cluster labels. Jensen's inequality applied to the concave logarithm gives

$$ \log p(\mathbf{x}_i) = \log\sum_{k}q_i(k)\,\frac{\pi_k\mathcal{N}(\mathbf{x}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{q_i(k)} \;\ge\; \sum_{k}q_i(k)\log\frac{\pi_k\mathcal{N}(\mathbf{x}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{q_i(k)}. $$

The right-hand side, summed over $i$, is the evidence lower bound $\mathcal{L}(q,\boldsymbol{\theta})$. A short calculation shows

$$ \ell(\boldsymbol{\theta}) = \mathcal{L}(q,\boldsymbol{\theta}) + \sum_{i}\mathrm{KL}\!\left(q_i \,\Big\|\, p(z_i\mid\mathbf{x}_i;\boldsymbol{\theta})\right), $$

so the lower bound becomes tight precisely when $q_i$ is set to the posterior, exactly the responsibility formula. That is what the E-step does. The M-step then maximises the (now tight) lower bound over $\boldsymbol{\theta}$ with $q$ held fixed; because each Gaussian's contribution decouples, the maximisation is closed-form.

The combined effect is coordinate ascent on the ELBO, and because $\mathcal{L}\le\ell$, every iteration of EM weakly increases the log-likelihood:

$$ \ell(\boldsymbol{\theta}^{(t+1)}) \;\ge\; \mathcal{L}(q^{(t+1)},\boldsymbol{\theta}^{(t+1)}) \;\ge\; \mathcal{L}(q^{(t+1)},\boldsymbol{\theta}^{(t)}) \;=\; \ell(\boldsymbol{\theta}^{(t)}). $$

Convergence is to a stationary point, typically a local optimum, and is monotonic. EM does not guarantee the global optimum, so multiple restarts from different initialisations are standard practice. Initialisation by k-means++ centres, with covariances seeded as the global sample covariance and weights set to $1/K$, is a reliable default.

A few practical cautions follow from the formulas. The covariance update can collapse: if a single point $\mathbf{x}_i$ ends up with $\gamma_{ik}\approx 1$ and all other points have $\gamma_{jk}\approx 0$, then $\boldsymbol{\Sigma}_k\to\mathbf{0}$ and the likelihood diverges to $+\infty$. The standard cure is to add a small ridge $\lambda\mathbf{I}$ to every covariance after each update, equivalent to a weak Wishart prior. For high-dimensional problems one further restricts the covariance form, diagonal, spherical, or tied (shared across components), to keep the parameter count manageable.

Worked example

Consider three points in one dimension: $x_1=-2$, $x_2=0$, $x_3=2$. We fit a GMM with $K=2$ components, starting from $\pi_1=\pi_2=0.5$, $\mu_1=-1$, $\mu_2=+1$, and $\sigma_1=\sigma_2=1$.

E-step (iteration 1). The Gaussian density at $x$ with mean $\mu$ and unit variance is $\phi(x-\mu)=(2\pi)^{-1/2}\exp\!\bigl(-(x-\mu)^2/2\bigr)$. Because both weights and variances are equal, the responsibility for component 1 simplifies to

$$ \gamma_{i1} = \frac{\exp\bigl(-(x_i-\mu_1)^2/2\bigr)}{\exp\bigl(-(x_i-\mu_1)^2/2\bigr)+\exp\bigl(-(x_i-\mu_2)^2/2\bigr)}. $$

For $x_1=-2$: the squared distances are $(-2-(-1))^2=1$ and $(-2-1)^2=9$, so $\gamma_{11}=e^{-0.5}/(e^{-0.5}+e^{-4.5})=0.6065/(0.6065+0.0111)\approx 0.9820$ and $\gamma_{12}\approx 0.0180$.

For $x_2=0$: the squared distances are $1$ and $1$, so $\gamma_{21}=\gamma_{22}=0.5$ exactly.

For $x_3=+2$: by symmetry, $\gamma_{31}\approx 0.0180$ and $\gamma_{32}\approx 0.9820$.

M-step (iteration 1). Compute the effective counts:

$N_1 = 0.9820 + 0.5 + 0.0180 = 1.500$, $N_2 = 0.0180 + 0.5 + 0.9820 = 1.500$.

Update the means as weighted averages:

$\mu_1 = \bigl(0.9820\cdot(-2) + 0.5\cdot 0 + 0.0180\cdot 2\bigr)/1.500 = (-1.964 + 0 + 0.036)/1.500 = -1.928/1.500 \approx -1.285$.

$\mu_2 = \bigl(0.0180\cdot(-2) + 0.5\cdot 0 + 0.9820\cdot 2\bigr)/1.500 = (-0.036 + 0 + 1.964)/1.500 \approx +1.285$.

The mixture weights remain $\pi_1=\pi_2=N_k/n=1.5/3=0.5$. The variances also update; for brevity we hold them at $\sigma=1$, which is qualitatively close to the result.

Subsequent iterations. The means have moved outwards. On the next E-step, $x_1=-2$ is now even closer to $\mu_1$ in relative terms, sharpening its responsibility further; the same on the right. The middle point $x_2=0$ remains exactly equidistant, its responsibility stays at $0.5$ for both components by symmetry. Iterating, the two means drift to roughly $\pm 1.6$, then $\pm 1.8$, then asymptotically toward $\pm 2$. After about ten iterations the parameters have stabilised at $\mu_1\approx -2$, $\mu_2\approx +2$, with the middle point split equally between the clusters.

This is a useful illustration of two GMM features. First, the responsibility for $x_2$ never collapses to a hard label, it cannot, because $x_2$ is genuinely equidistant from the two cluster centres and the model retains that uncertainty rather than discarding it. K-means, by contrast, would assign $x_2$ arbitrarily to one cluster on every iteration depending on a tiebreak. Second, EM converges quickly here because the data is well-separated and the geometry is simple; on real data with overlapping clusters and high dimensionality, dozens of iterations may be needed.

It is also instructive to track the log-likelihood across iterations. At initialisation the joint density assigns moderate probability to every point through both components; after one EM step the means have separated, and the per-point likelihoods $p(x_i)$ rise (because $x_1$ now sits much closer to the much-improved $\mu_1$, and similarly for $x_3$). The log-likelihood therefore increases monotonically, as the theory promises, until the means stabilise and the gain per iteration drops below the convergence tolerance. If we had initialised at $\mu_1=\mu_2=0$, EM would never escape, by symmetry every responsibility would stay at $0.5$ and the means would never split. This is a manifestation of the local-optima problem in miniature, and it is why initialisation by k-means++ rather than at a single point is the practical default.

Connection to k-means

K-means is GMM in a particular limit. Suppose every component shares the same isotropic covariance $\boldsymbol{\Sigma}_k=\sigma^2\mathbf{I}$ and the weights are equal $\pi_k=1/K$. The responsibility formula becomes

$$ \gamma_{ik} = \frac{\exp\bigl(-\|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2/(2\sigma^2)\bigr)}{\sum_{j}\exp\bigl(-\|\mathbf{x}_i-\boldsymbol{\mu}_j\|^2/(2\sigma^2)\bigr)} \;=\; \mathrm{softmax}_{k}\!\bigl(-\tfrac{1}{2\sigma^2}\|\mathbf{x}_i-\boldsymbol{\mu}_k\|^2\bigr). $$

As $\sigma^2\to 0$, the softmax temperature goes to zero and the responsibility becomes a one-hot indicator: $\gamma_{ik}=1$ if $k=\arg\min_{j}\|\mathbf{x}_i-\boldsymbol{\mu}_j\|^2$ and zero otherwise. The E-step becomes the k-means assignment step. The M-step's weighted mean reduces to the unweighted mean over points assigned to cluster $k$, which is the k-means centroid update. K-means is therefore hard EM for an isotropic Gaussian mixture with vanishing variance and equal weights.

This view explains several of the empirical differences. K-means produces spherical Voronoi cells because it cannot represent anisotropy or rotation; the GMM can. K-means makes hard assignments and so cannot express genuine ambiguity at cluster boundaries; the GMM can. K-means tends to balance cluster sizes (because cells partition space) whereas the GMM, with its $\pi_k$ parameter, is happy with very unequal sizes. K-means optimises a sum of squared distances; the GMM optimises log-likelihood under a probabilistic model. Each property follows from the limit.

There is a practical consequence too. Initialising EM for a GMM by running k-means first and using its centroids as starting means almost always converges faster and more reliably than initialising at random. The two algorithms live on a continuum, and the cheap one is a good warm-start for the expensive one.

Choosing K

Choosing the number of components is a model-selection problem, and one of the more delicate ones in unsupervised learning. The training log-likelihood alone cannot tell us, adding components always increases $\ell$, ultimately to the singularity of one Gaussian per point with $\sigma\to 0$ and $\ell\to+\infty$. We need a criterion that trades off fit against complexity.

The standard penalised-likelihood criterion is the Bayesian information criterion

$$ \mathrm{BIC} = -2\ell(\hat{\boldsymbol{\theta}}) + p_K\log n, $$

where $p_K$ is the number of free parameters. For a $d$-dimensional GMM with full covariances, $p_K = K-1 + Kd + Kd(d+1)/2$ (weights, means, and the upper triangles of the covariances, minus one for the constraint $\sum_k\pi_k=1$). Diagonal or tied covariances reduce $p_K$ substantially. The smaller the BIC, the better. The Akaike information criterion (AIC) replaces $p_K\log n$ with $2p_K$ and so penalises complexity less aggressively; for clustering it tends to pick larger $K$ than BIC. In practice one fits the GMM for a range of $K$, typically $K=1$ up to $K=20$, and picks the elbow or minimum of the BIC curve.

Cross-validation on held-out log-likelihood is an alternative that avoids the asymptotic assumptions behind BIC but requires several refits and so is more compute-intensive. The integrated completed likelihood (ICL) augments BIC with an entropy term that penalises models whose responsibilities are diffuse; this often produces more interpretable clusters when the goal is partitioning rather than density estimation.

A more principled non-parametric alternative is the Dirichlet process Gaussian mixture, which places a stick-breaking prior on an infinite mixture and lets the data determine the effective $K$. The variational Bayesian implementation (e.g. scikit-learn's BayesianGaussianMixture with weight_concentration_prior_type="dirichlet_process") instantiates a large $K_{\max}$ and prunes components whose posterior weight collapses, effectively learning $K$ by Occam's razor in a single fit. The concentration hyperparameter trades off the prior preference for fewer or more clusters.

Where GMMs are used

Speech and audio. Voice activity detection separates speech frames from non-speech by modelling each as a Gaussian mixture in cepstral feature space; the likelihood ratio is then thresholded against a calibrated decision boundary. Phoneme acoustic models in classical hidden-Markov-model speech recognisers used GMMs at every state to model the emission density, with tens of mixture components per state and tied covariances across phonetic context. GMMs also underlie speaker verification systems via universal background models and i-vector pipelines.
Background subtraction in video. The pixel-wise GMM model of Stauffer and Grimson fits a mixture per pixel over time; each new frame is classified as background or foreground by likelihood under the learned mixture, with adaptive update rules that let the background model track gradual lighting changes while still flagging moving objects.
Anomaly detection. Once a GMM has been fitted to in-distribution data, the marginal log-density gives a principled novelty score: the low-density tails are unlikely under any component, and points falling there are flagged for review. The probabilistic framing yields calibrated thresholds rather than ad-hoc distances, and per-component responsibilities can identify which mode a near-anomaly is closest to.
Soft clustering in genuinely overlapping data. When clusters genuinely overlap, gene-expression profiles across cell types, customer-segment behaviour, sensor data with operating-mode mixtures, the soft assignments produced by a GMM convey real information that hard k-means labels would discard. Downstream tasks can use the responsibilities as features.
Density estimation. Beyond clustering, GMMs are flexible non-parametric density models (in the limit of large $K$) and are used as building blocks inside larger probabilistic systems, including normalising flow priors, variational autoencoder priors, and importance-sampling proposals in particle filters.

What you should take away

A Gaussian mixture model treats each point as a draw from one of $K$ Gaussians, with the cluster label hidden; fitting the model recovers the means, covariances, weights, and a posterior probability $\gamma_{ik}$ that point $i$ belongs to cluster $k$.
The EM algorithm fits the model by alternating an E-step that computes responsibilities from the current parameters, and an M-step that re-estimates parameters as weighted maximum-likelihood fits using those responsibilities; each iteration weakly increases the log-likelihood.
EM is coordinate ascent on the evidence lower bound (ELBO); the bound is tight whenever $q_i$ equals the true posterior, which is exactly what the E-step enforces.
K-means is the hard, isotropic, vanishing-variance limit of a GMM, the GMM softens the assignment and lets clusters be elliptical, rotated and unequally sized, at the cost of more parameters and care over singularities.
Choose $K$ by BIC, cross-validation, or a Dirichlet-process variant; in practice always restart EM from several initialisations, regularise covariances with a small ridge, and consider tied or diagonal covariance forms in high dimensions.