Expectation–Maximisation, Glossary, Textbook of AI

Also known as: EM, Baum-Welch (HMM case)

Expectation–Maximisation (EM) is a general algorithm for maximum-likelihood estimation in models with latent (unobserved) variables. For observed data $X$, latent variables $Z$, and parameters $\theta$, the marginal likelihood

$$P(X | \theta) = \sum_Z P(X, Z | \theta)$$

is often intractable to maximise directly. EM iteratively maximises a lower bound (the ELBO).

E-step: compute the posterior over latents under current parameters $\theta^{(t)}$:

$$Q_Z(Z) = P(Z | X, \theta^{(t)})$$

M-step: maximise the expected complete-data log-likelihood under this posterior:

$$\theta^{(t+1)} = \arg\max_\theta \mathbb{E}_{Z \sim Q_Z}\!\left[\log P(X, Z | \theta)\right]$$

EM monotonically increases the marginal log-likelihood: $\log P(X | \theta^{(t+1)}) \geq \log P(X | \theta^{(t)})$. Convergence to a local maximum is guaranteed under mild regularity conditions; multiple random restarts are needed to escape local optima.

Canonical applications:

Gaussian mixture model. $K$ Gaussian components, weights $\pi_k$, means $\mu_k$, covariances $\Sigma_k$. Latent $z_n \in \{1, \ldots, K\}$ is the component assignment for data point $x_n$.

E-step: $\gamma_{nk} = P(z_n = k | x_n) = \pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k) / \sum_j \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)$ (responsibilities).
M-step: $\pi_k = \frac{1}{N} \sum_n \gamma_{nk}$, $\mu_k = \sum_n \gamma_{nk} x_n / \sum_n \gamma_{nk}$, $\Sigma_k = \sum_n \gamma_{nk} (x_n - \mu_k)(x_n - \mu_k)^\top / \sum_n \gamma_{nk}$.

K-means is a special case where $\Sigma_k = \sigma^2 I$ with $\sigma \to 0$, giving hard assignments rather than soft.

Hidden Markov Model: E-step uses the forward-backward algorithm to compute posterior probabilities of latent states; M-step updates transition and emission distributions. The combined procedure is the Baum-Welch algorithm.

Latent Dirichlet Allocation: variational EM, where the E-step uses variational inference to approximate the intractable posterior over topics.

Variational EM (used when E-step is intractable): replace the exact posterior $P(Z | X, \theta)$ with a variational approximation $q(Z)$, jointly optimising $q$ and $\theta$ to maximise the ELBO. The VAE is a special case: an amortised variational EM with $q$ parameterised by an encoder network.

Stochastic EM processes mini-batches and updates incrementally; useful for very large datasets.

Video

Discussed in:

Chapter 5: Statistics, Statistics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).