Solutions to selected exercises

14.1. Explicit-density: PixelCNN (tractable), RealNVP (tractable), VAE (variational). Implicit-density: GAN. Energy-based models are explicit but unnormalised. DDPM is variational/score-based, sometimes classed as variational explicit (since it has a likelihood lower bound), sometimes as score-based.

14.4. When $p_{\text{data}} = p_g$ everywhere, $D^*(x) = 1/2$ for all $x$. The discriminator is at chance. The generator's gradient signal is zero, the game has converged. The value $V(D^*, G^*) = -2\log 2$.

14.7. $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1 - \bar\alpha_t)I)$. This matters because we can sample $x_t$ in one step instead of simulating the entire forward chain $x_0 \to x_1 \to \cdots \to x_t$. Without it, training would cost $O(T)$ per gradient step instead of $O(1)$.

14.11. Start from $\log p(x) = \log \int p(x, z)\, dz = \log \int q(z \mid x)\,\frac{p(x, z)}{q(z \mid x)}\,dz = \log \mathbb{E}_{q}[p(x, z)/q(z \mid x)]$. By Jensen's inequality applied to the concave $\log$, $\log \mathbb{E}_q[\cdot] \geq \mathbb{E}_q[\log \cdot]$, giving $\log p(x) \geq \mathbb{E}_q[\log p(x, z) - \log q(z \mid x)]$, the ELBO. Equality holds iff $q(z \mid x) = p(z \mid x)$.

14.12. Both Gaussians have diagonal covariance, so the integral factorises across dimensions. For a single dimension with Gaussians $\mathcal{N}(\mu_1, \sigma_1^2)$ and $\mathcal{N}(\mu_2, \sigma_2^2)$: $$\mathrm{KL} = \int q \log q - \int q \log p$$ $$= -\frac{1}{2}\log(2\pi\sigma_1^2) - \frac{1}{2} + \frac{1}{2}\log(2\pi\sigma_2^2) + \mathbb{E}_q\!\left[\frac{(z - \mu_2)^2}{2\sigma_2^2}\right]$$ The expectation evaluates to $((\sigma_1^2 + (\mu_1 - \mu_2)^2))/(2\sigma_2^2)$. Combine: $\mathrm{KL} = \log(\sigma_2/\sigma_1) + (\sigma_1^2 + (\mu_1 - \mu_2)^2)/(2\sigma_2^2) - 1/2$. Sum over $d$ dimensions for the multivariate diagonal case.

14.13. With a fixed $G$ implying generator distribution $p_g$, the value $V(D, G)$ at point $x$ is $p_{\text{data}}(x)\log D(x) + p_g(x)\log(1 - D(x))$. Pointwise maximisation in $D$: differentiate and set to zero, giving $D^*(x) = p_{\text{data}}(x)/(p_{\text{data}}(x) + p_g(x))$. Substituting: $$V(D^*, G) = \mathbb{E}_{p_d}[\log p_d/(p_d + p_g)] + \mathbb{E}_{p_g}[\log p_g/(p_d + p_g)]$$ Add and subtract $\log 2$: each expectation becomes $-\log 2 + \mathbb{E}_{p_\bullet}[\log(2p_\bullet/(p_d + p_g))] = -\log 2 + \mathrm{KL}(p_\bullet \| (p_d + p_g)/2)$. Sum: $V(D^*, G) = -2\log 2 + 2\mathrm{JSD}(p_d \| p_g)$. Minimum at $p_g = p_d$, value $-2\log 2$.

14.14. By induction. For $t = 1$: $q(x_1 \mid x_0) = \mathcal{N}(\sqrt{\alpha_1}x_0, \beta_1 I) = \mathcal{N}(\sqrt{\bar\alpha_1}x_0, (1-\bar\alpha_1)I)$ since $\bar\alpha_1 = \alpha_1$ and $1 - \alpha_1 = \beta_1$. Inductive step: assume $q(x_{t-1} \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}}x_0, (1-\bar\alpha_{t-1})I)$. Then $x_{t-1} = \sqrt{\bar\alpha_{t-1}}x_0 + \sqrt{1-\bar\alpha_{t-1}}\eta$ with $\eta \sim \mathcal{N}(0, I)$. By definition $x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{\beta_t}\xi = \sqrt{\alpha_t \bar\alpha_{t-1}}x_0 + \sqrt{\alpha_t(1 - \bar\alpha_{t-1})}\eta + \sqrt{\beta_t}\xi$. The two noise terms are independent Gaussians; their sum is Gaussian with variance $\alpha_t(1-\bar\alpha_{t-1}) + \beta_t = \alpha_t - \alpha_t \bar\alpha_{t-1} + 1 - \alpha_t = 1 - \bar\alpha_t$. So $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$.

14.16. $q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$. Log-density: $\log q = -(1/2)(x_t - \sqrt{\bar\alpha_t}x_0)^\top (x_t - \sqrt{\bar\alpha_t}x_0)/(1-\bar\alpha_t) + \text{const}$. Gradient wrt $x_t$: $-(x_t - \sqrt{\bar\alpha_t}x_0)/(1-\bar\alpha_t) = -\sqrt{1-\bar\alpha_t}\epsilon/(1-\bar\alpha_t) = -\epsilon/\sqrt{1-\bar\alpha_t}$, recovering the score-noise relationship.

14.17. $J = \begin{pmatrix} \partial y_a/\partial x_a & \partial y_a/\partial x_b \\ \partial y_b/\partial x_a & \partial y_b/\partial x_b \end{pmatrix} = \begin{pmatrix} I & 0 \\ * & \mathrm{diag}(\exp(s(x_a))) \end{pmatrix}$. Triangular, so $\det J = \prod_i \exp(s_i(x_a))$ and $\log|\det J| = \sum_i s_i(x_a)$, regardless of the form of the (potentially complicated) lower-left block. Note: this is why the architecture exists, to deliver a tractable Jacobian without constraining the form of $s$ and $t$.

14.19. Substitute $\sigma_1 = \sigma_2 = \sigma$ into the result of 14.12: $\log(\sigma/\sigma) + (\sigma^2 + (\mu_1 - \mu_2)^2)/(2\sigma^2) - 1/2 = (\mu_1 - \mu_2)^2/(2\sigma^2)$. Summing over $d$ dimensions: $\|\mu_1 - \mu_2\|^2/(2\sigma^2)$. This is the mean-squared error up to a constant, the reason DDPM's KL terms collapse to MSE on the noise.

14.21. [Implementation skeleton]

model = train(epochs=10)  # see §14.4
# Reconstructions
model.eval()
x_test = next(iter(test_loader))[0][:16]
x_hat, _, _ = model(x_test)
# Plot side-by-side x and x_hat. Save grid.
# Samples
z = torch.randn(100, 32, device=device)
samples = model.decode(z).view(-1, 28, 28).cpu()
# Plot 10x10 grid.

After 10 epochs, expect ELBO around 95 nats per pixel-batch, recognisable digits, smooth interpolation between samples.

14.26. [Implementation pointer.] After training the DDPM of §14.16, the sampling time is dominated by 1000 forward passes of the U-Net. Implementing the DDIM sampler involves a single change: instead of stepping $t \to t-1$, you step over a subset $\{\tau_S, \tau_{S-1}, \ldots, \tau_1\}$, and the update uses the formula $x_{\tau_{i-1}} = \sqrt{\bar\alpha_{\tau_{i-1}}}\hat x_0 + \sqrt{1 - \bar\alpha_{\tau_{i-1}}}\,\epsilon_\theta$, where $\hat x_0 = (x_{\tau_i} - \sqrt{1 - \bar\alpha_{\tau_i}}\epsilon_\theta)/\sqrt{\bar\alpha_{\tau_i}}$ and $\sigma_t = 0$ for fully deterministic DDIM.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).