Also known as: EM, Baum-Welch (HMM case)
Expectation–Maximisation (EM) is a general algorithm for maximum-likelihood estimation in models with latent (unobserved) variables. For observed data $X$, latent variables $Z$, and parameters $\theta$, the marginal likelihood
$$P(X | \theta) = \sum_Z P(X, Z | \theta)$$
is often intractable to maximise directly. EM iteratively maximises a lower bound (the ELBO).
E-step: compute the posterior over latents under current parameters $\theta^{(t)}$:
$$Q_Z(Z) = P(Z | X, \theta^{(t)})$$
M-step: maximise the expected complete-data log-likelihood under this posterior:
$$\theta^{(t+1)} = \arg\max_\theta \mathbb{E}_{Z \sim Q_Z}\!\left[\log P(X, Z | \theta)\right]$$
EM monotonically increases the marginal log-likelihood: $\log P(X | \theta^{(t+1)}) \geq \log P(X | \theta^{(t)})$. Convergence to a local maximum is guaranteed under mild regularity conditions; multiple random restarts are needed to escape local optima.
Canonical applications:
Gaussian mixture model. $K$ Gaussian components, weights $\pi_k$, means $\mu_k$, covariances $\Sigma_k$. Latent $z_n \in \{1, \ldots, K\}$ is the component assignment for data point $x_n$.
- E-step: $\gamma_{nk} = P(z_n = k | x_n) = \pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k) / \sum_j \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)$ (responsibilities).
- M-step: $\pi_k = \frac{1}{N} \sum_n \gamma_{nk}$, $\mu_k = \sum_n \gamma_{nk} x_n / \sum_n \gamma_{nk}$, $\Sigma_k = \sum_n \gamma_{nk} (x_n - \mu_k)(x_n - \mu_k)^\top / \sum_n \gamma_{nk}$.
K-means is a special case where $\Sigma_k = \sigma^2 I$ with $\sigma \to 0$, giving hard assignments rather than soft.
Hidden Markov Model: E-step uses the forward-backward algorithm to compute posterior probabilities of latent states; M-step updates transition and emission distributions. The combined procedure is the Baum-Welch algorithm.
Latent Dirichlet Allocation: variational EM, where the E-step uses variational inference to approximate the intractable posterior over topics.
Variational EM (used when E-step is intractable): replace the exact posterior $P(Z | X, \theta)$ with a variational approximation $q(Z)$, jointly optimising $q$ and $\theta$ to maximise the ELBO. The VAE is a special case: an amortised variational EM with $q$ parameterised by an encoder network.
Stochastic EM processes mini-batches and updates incrementally; useful for very large datasets.
Video
Related terms: Maximum Likelihood Estimation, KL Divergence, Variational Autoencoder, Gaussian Mixture Model, Hidden Markov Model
Discussed in:
- Chapter 5: Statistics, Statistics