14.15 Practical considerations
Sample quality versus diversity
Every generative model trades off these two. Maximum-likelihood training (autoregressive, flows, VAEs) tends to favour coverage: the model places mass on all modes, even at the cost of fuzzy detail. Adversarial training (GANs) tends to favour quality: sharp samples, but mode collapse is endemic. Diffusion models are unusual in striking a good balance between the two. Classifier-free guidance is a knob that explicitly trades coverage for quality at inference time.
Evaluating generative models
Likelihood is meaningful for explicit-density models but not for GANs (no likelihood) or diffusion models (likelihood is a loose lower bound). The community has converged on two metrics:
- Inception Score (IS) (Salimans et al., 2016): exponential of the expected KL between the per-sample class distribution from a pretrained Inception network and the marginal class distribution. High IS means each sample is confidently classified (sharpness) and the marginal is uniform across classes (diversity).
- Fréchet Inception Distance (FID) (Heusel et al., 2017): fits a Gaussian to the Inception feature distribution of real and generated samples, and computes the squared Wasserstein distance between the two Gaussians: $\mathrm{FID} = \|\mu_r - \mu_g\|^2 + \mathrm{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$. Lower is better. FID has become the default metric for image generation.
Both metrics depend on the choice of pretrained network and have known failure modes (memorisation of training data lowers FID, but the samples are not novel). Recent work uses CLIP features (CLIP-FID) or human studies as alternatives.
Common failure modes
- Mode collapse (GANs): the generator memorises a few modes. Fixes: WGAN-GP, minibatch discrimination, packed discriminator inputs.
- Posterior collapse (VAEs): the encoder learns to ignore $x$, mapping everything to the prior; the decoder ignores $z$. Fixes: KL annealing, free bits, autoregressive decoders held in check, β-VAE with $\beta < 1$.
- Slow sampling (diffusion): hundreds of network evaluations per sample. Fixes: DDIM, DPM-Solver, distillation (consistency models, progressive distillation), latent diffusion.
- Memorisation: any model with sufficient capacity may memorise training examples. Audit with nearest-neighbour searches; add noise to training data; bound model capacity.
- Bias amplification: generative models reproduce, and often amplify, biases in training data. Crucial for any application involving people.
Hyperparameter sensitivity
Diffusion models are forgiving: the same hyperparameters work across many datasets. GANs are notoriously brittle; the grid of viable learning rates, $\beta_1$, batch sizes, and architecture choices is narrow. VAEs are intermediate. When choosing an architecture, factor in the engineering cost of getting it to work, not just the asymptotic sample quality.