Exercises
Conceptual
14.1. Distinguish explicit-density from implicit-density generative models. Place the following in the correct category: PixelCNN, RealNVP, GAN, VAE, DDPM, energy-based model.
14.2. Why is the marginal $p_\theta(x) = \int p_\theta(x \mid z)p(z)\,dz$ intractable for a VAE? What goes wrong if you try to estimate it by Monte Carlo with $z \sim p(z)$?
14.3. State the change-of-variable formula in $d$ dimensions. Why does coupling-layer architecture make the determinant tractable?
14.4. A GAN's optimal discriminator has the form $D^*(x) = p_{\text{data}}(x)/(p_{\text{data}}(x) + p_g(x))$. What does $D^*$ predict when $p_{\text{data}} = p_g$ everywhere? What does this imply about the generator's loss at equilibrium?
14.5. Explain why DDPM training is more stable than GAN training. Name two specific failure modes of GANs that diffusion models do not suffer from.
14.6. What is posterior collapse in a VAE? Why is it more likely with a powerful autoregressive decoder?
14.7. State the closed-form expression for $q(x_t \mid x_0)$ in DDPM. Why does this matter for training?
14.8. Classifier-free guidance amplifies the gap between conditional and unconditional noise predictions. What does the guidance scale $w$ trade off?
14.9. Why does latent diffusion achieve a $\sim 48\times$ compute reduction relative to pixel-space diffusion? Where does the cost go?
14.10. Explain the two metrics IS and FID. Give one situation in which each can mislead.
Derivations
14.11. Derive the ELBO for a latent-variable model from $\log p(x) = \log \int p(x, z) \, dz$ via Jensen's inequality applied to $q(z \mid x)$.
14.12. Derive the closed-form KL divergence between two diagonal Gaussians $\mathcal{N}(\mu_1, \sigma_1^2 I)$ and $\mathcal{N}(\mu_2, \sigma_2^2 I)$ in $d$ dimensions.
14.13. Show that the optimal discriminator of a GAN is $D^*(x) = p_{\text{data}}(x)/(p_{\text{data}}(x) + p_g(x))$. Derive the equivalence between the GAN value at $D^*$ and twice the JS divergence (minus a constant).
14.14. Prove the closed-form forward marginal of DDPM: $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1 - \bar\alpha_t)I)$.
14.15. Derive the form of $\mu_\theta(x_t, t)$ in terms of a noise-prediction network $\epsilon_\theta(x_t, t)$, starting from the closed-form true posterior $q(x_{t-1} \mid x_t, x_0)$.
14.16. Show that the score function of the DDPM forward marginal is $\nabla_{x_t} \log q(x_t \mid x_0) = -(x_t - \sqrt{\bar\alpha_t}x_0)/(1 - \bar\alpha_t)$.
14.17. A coupling layer applies $y_a = x_a$, $y_b = x_b \odot \exp(s(x_a)) + t(x_a)$. Derive the form of the Jacobian and confirm that its log-determinant is $\sum_i s_i(x_a)$.
14.18. Derive the Wasserstein-1 distance via the Kantorovich-Rubinstein duality, and explain why a 1-Lipschitz constraint on the critic is necessary.
14.19. Show that for two identical-variance diagonal Gaussians $\mathcal{N}(\mu_1, \sigma^2 I)$ and $\mathcal{N}(\mu_2, \sigma^2 I)$, the KL divergence reduces to $\|\mu_1 - \mu_2\|^2/(2\sigma^2)$.
14.20. Starting from $V(D^*, G) = -2\log 2 + 2\,\mathrm{JSD}(p_{\text{data}} \,\|\, p_g)$, find the global minimum of the GAN value as a function of $G$. What is $V(D^*, G^*)$?
Programming
14.21. Implement and train the VAE of §14.4 on MNIST. Plot the loss curve, the reconstruction of test images at epochs 1, 5, 10, and a $10\times 10$ grid of samples drawn from $\mathcal{N}(0, I)$ and decoded.
14.22. Modify your VAE to a β-VAE with $\beta \in \{0.1, 1, 5, 20\}$. Plot reconstruction error and KL term separately. What happens to sample diversity at $\beta = 20$?
14.23. Implement the toy GAN of §14.5 on MNIST. Tune $\beta_1$ in Adam: try 0.5, 0.9. Compare training stability.
14.24. Implement WGAN-GP. Replace the discriminator's BCE with the Wasserstein critic loss and add the gradient penalty term. Compare FID against the original GAN after 50 epochs.
14.25. Implement a RealNVP coupling-layer flow on the two-moons toy distribution. Plot the learned density.
14.26. Train the from-scratch DDPM of §14.16 on MNIST for 20 epochs. Generate 64 samples. Compare with samples produced by DDIM with 50 steps.
14.27. Add classifier-free guidance to the DDPM. Train conditioned on the digit class with 10% probability of dropping the label. Generate samples for class 7 at guidance scales $w \in \{0, 1, 3, 7\}$.
14.28. Implement DDIM sampling on a model trained as DDPM (no retraining required). Verify the sample quality at $S \in \{1000, 100, 50, 20, 10\}$ steps.
14.29. On CIFAR-10, train the DDPM with a deeper U-Net (4 resolution levels, channels $[64, 128, 256, 512]$). Compute FID against the test set.
14.30. Replace the VAE's MLP encoder/decoder with convolutional networks. Train on CIFAR-10. Compare reconstructions and samples to the MLP version.
Open-ended
14.31. Read Song et al. (2021) 2020. Reformulate the DDPM you trained in 14.26 as a discretisation of the variance-preserving SDE. Verify the equivalence empirically by comparing samples.
14.32. Implement consistency-model distillation of your trained DDPM. Show one-step sampling.
14.33. Choose one application domain (audio, molecules, video, text) and survey how each of the six families in §14.2 has been applied. Which dominates and why?