Wasserstein Distance, Glossary, Textbook of AI

Also known as: Earth Mover's Distance, EMD, Kantorovich distance

The Wasserstein-$p$ distance (or Earth Mover's Distance) between probability distributions $\mu, \nu$ on a metric space $(\mathcal{X}, d)$ is

$$W_p(\mu, \nu) = \!\left(\inf_{\gamma \in \Gamma(\mu, \nu)} \mathbb{E}_{(x, y) \sim \gamma}[d(x, y)^p]\right)^{1/p}$$

where $\Gamma(\mu, \nu)$ is the set of joint distributions (couplings) with marginals $\mu$ and $\nu$. Intuitively: the minimum cost of transporting mass from $\mu$ to $\nu$, where transporting one unit from $x$ to $y$ costs $d(x, y)^p$.

For $p = 1$ on $\mathbb{R}$, the Kantorovich–Rubinstein duality gives a tractable form:

$$W_1(\mu, \nu) = \sup_{f: \|f\|_L \leq 1} \mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{x \sim \nu}[f(x)]$$

where the supremum is over 1-Lipschitz functions. This dual formulation is what Wasserstein GAN exploits.

Why use Wasserstein?

The KL divergence $D_\mathrm{KL}(\mu \| \nu)$ has problematic properties for training generative models: it's infinite when $\nu$ assigns zero probability to a region where $\mu$ doesn't, and it provides poor gradient signal when $\mu$ and $\nu$ have disjoint support (common in early GAN training).

Wasserstein distance:

Finite even for distributions with disjoint support.
Continuous in $\mu, \nu$ in many natural senses.
Provides meaningful gradients even when distributions don't overlap.
Closely related to Earth Mover's Distance which has a rich classical literature.

Wasserstein GAN (Arjovsky, Chintala, Bottou 2017):

$$\min_G \max_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\mathrm{data}}[f(x)] - \mathbb{E}_{z \sim p(z)}[f(G(z))]$$

The "discriminator" is replaced by a 1-Lipschitz critic $f$ that estimates the Wasserstein distance. The generator minimises the estimated distance.

Lipschitz constraint enforcement:

Weight clipping: clip critic weights to $[-c, c]$. Original WGAN. Crude.
Gradient penalty (WGAN-GP, Gulrajani 2017): penalise $(\|\nabla f\| - 1)^2$ at random interpolations between real and generated samples. Standard modern formulation.
Spectral normalisation: normalise weight matrices to have spectral norm $\leq 1$. Enforces 1-Lipschitz layer-wise, hence Lipschitz overall (with bounded constant).

Stability: WGAN trains far more stably than original GAN. Loss is interpretable (corresponds to actual Wasserstein distance, decreasing during training). Not a panacea, diffusion models have largely displaced GANs for high-quality image generation regardless of variant.

Sinkhorn distance: regularised Wasserstein computed by the Sinkhorn algorithm. Computable in $O(n^2)$ rather than the $O(n^3)$ exact LP. Used for differentiable optimal transport in modern deep learning.

Applications beyond GAN:

Domain adaptation: align source and target feature distributions.
Optimal transport for clustering and matching.
Word Mover's Distance (Kusner et al. 2015) for text similarity.
Wasserstein autoencoders (Tolstikhin 2018).

Video

Related terms: KL Divergence, Generative Adversarial Network

Discussed in:

Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).