Also known as: Earth Mover's Distance, EMD, Kantorovich distance
The Wasserstein-$p$ distance (or Earth Mover's Distance) between probability distributions $\mu, \nu$ on a metric space $(\mathcal{X}, d)$ is
$$W_p(\mu, \nu) = \!\left(\inf_{\gamma \in \Gamma(\mu, \nu)} \mathbb{E}_{(x, y) \sim \gamma}[d(x, y)^p]\right)^{1/p}$$
where $\Gamma(\mu, \nu)$ is the set of joint distributions (couplings) with marginals $\mu$ and $\nu$. Intuitively: the minimum cost of transporting mass from $\mu$ to $\nu$, where transporting one unit from $x$ to $y$ costs $d(x, y)^p$.
For $p = 1$ on $\mathbb{R}$, the Kantorovich–Rubinstein duality gives a tractable form:
$$W_1(\mu, \nu) = \sup_{f: \|f\|_L \leq 1} \mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{x \sim \nu}[f(x)]$$
where the supremum is over 1-Lipschitz functions. This dual formulation is what Wasserstein GAN exploits.
Why use Wasserstein?
The KL divergence $D_\mathrm{KL}(\mu \| \nu)$ has problematic properties for training generative models: it's infinite when $\nu$ assigns zero probability to a region where $\mu$ doesn't, and it provides poor gradient signal when $\mu$ and $\nu$ have disjoint support (common in early GAN training).
- Finite even for distributions with disjoint support.
- Continuous in $\mu, \nu$ in many natural senses.
- Provides meaningful gradients even when distributions don't overlap.
- Closely related to Earth Mover's Distance which has a rich classical literature.
Wasserstein GAN (Arjovsky, Chintala, Bottou 2017):
$$\min_G \max_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\mathrm{data}}[f(x)] - \mathbb{E}_{z \sim p(z)}[f(G(z))]$$
The "discriminator" is replaced by a 1-Lipschitz critic $f$ that estimates the Wasserstein distance. The generator minimises the estimated distance.
Lipschitz constraint enforcement:
- Weight clipping: clip critic weights to $[-c, c]$. Original WGAN. Crude.
- Gradient penalty (WGAN-GP, Gulrajani 2017): penalise $(\|\nabla f\| - 1)^2$ at random interpolations between real and generated samples. Standard modern formulation.
- Spectral normalisation: normalise weight matrices to have spectral norm $\leq 1$. Enforces 1-Lipschitz layer-wise, hence Lipschitz overall (with bounded constant).
Stability: WGAN trains far more stably than original GAN. Loss is interpretable (corresponds to actual Wasserstein distance, decreasing during training). Not a panacea, diffusion models have largely displaced GANs for high-quality image generation regardless of variant.
Sinkhorn distance: regularised Wasserstein computed by the Sinkhorn algorithm. Computable in $O(n^2)$ rather than the $O(n^3)$ exact LP. Used for differentiable optimal transport in modern deep learning.
Applications beyond GAN:
- Domain adaptation: align source and target feature distributions.
- Optimal transport for clustering and matching.
- Word Mover's Distance (Kusner et al. 2015) for text similarity.
- Wasserstein autoencoders (Tolstikhin 2018).
Video
Related terms: KL Divergence, Generative Adversarial Network
Discussed in:
- Chapter 14: Generative Models, Generative Models