Exercises
Exercise 15.1. Starting from the Hoffmann et al. parametric form $L(N, D) = E + A N^{-\alpha} + B D^{-\beta}$, derive the compute-optimal scaling exponents $a$ and $b$ such that $N_{\text{opt}} \propto C^a$ and $D_{\text{opt}} \propto C^b$. Show that when $\alpha = \beta$ both exponents equal $1/2$.
Exercise 15.2. Suppose you have a compute budget of $10^{24}$ FLOPs for pre-training a Transformer language model. Using Chinchilla ($D \approx 20 N$) and $C = 6 N D$, find the compute-optimal $N$ and $D$. Compare to Kaplan's recommendation, which would set $N \propto C^{0.73}$, how do the two recommendations differ?
Exercise 15.3. Sardana et al. (2023) argued that compute-optimal training is the wrong objective if a model will serve many inference tokens. Set up the augmented optimisation problem where total cost is training compute plus expected inference compute over $T_{\text{inf}}$ inference tokens. Show that the optimal $D / N$ ratio increases with $T_{\text{inf}}$.
Exercise 15.4. Schaeffer et al. (2023) argue that exact-match accuracy creates the appearance of phase transitions. Suppose the per-token error rate falls as $\varepsilon(C) = C^{-0.05}$ and a task requires $k = 8$ correct tokens. Plot exact-match accuracy $(1 - \varepsilon)^k$ and per-token accuracy $1 - \varepsilon$ against $C$ on log–log axes. Identify the apparent "phase transition" in the exact-match curve.
Exercise 15.5. For a 70 B-parameter dense Transformer trained on 1.4 T tokens, estimate the total training FLOPs and the wall-clock time on a 4096-GPU H100 cluster running at 50% MFU (each H100 delivers ~989 TFLOPs of BF16 throughput).
Exercise 15.6. Show that the Bradley–Terry model is invariant to additive shifts of the reward: if $r(x, y) \to r(x, y) + c(x)$, then the preference probabilities are unchanged. Discuss the practical consequences for reward-model training.
Exercise 15.7. Derive the gradient of the reward-model log-likelihood
$$ \mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}\bigl[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\bigr] $$
with respect to $\phi$. Express your answer using $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.
Exercise 15.8. Starting from the KL-constrained RL objective, derive the optimal policy $\pi^*(y \mid x) = Z(x)^{-1} \pi_{\text{ref}}(y \mid x) \exp(r(x, y) / \beta)$. State all assumptions.
Exercise 15.9. Show that, in the DPO derivation, the partition function $Z(x)$ cancels exactly in the difference of implicit rewards between $y_w$ and $y_l$, but does not cancel in the absolute value of either reward. What practical consequence does this have for reward-model-style use of a DPO-trained model?
Exercise 15.10. Implement the DPO loss in PyTorch given access to four log-probability tensors lp_pol_w, lp_pol_l, lp_ref_w, lp_ref_l. Make sure your implementation is numerically stable.
Exercise 15.11. Compare DPO and SimPO. SimPO uses
$$ \mathcal{L}_{\text{SimPO}} = -\log \sigma\bigl(\beta_{\text{S}} (\bar\ell_w - \bar\ell_l) - \gamma\bigr), $$
where $\bar\ell$ is the length-normalised log-probability and $\gamma$ is a margin. Argue whether SimPO can be cast as a DPO with implicit reward $\beta_{\text{S}} \bar\ell$, and state the conditions under which they are equivalent.
Exercise 15.12. GRPO uses group-relative advantages $\hat A_i = (r_i - \mathrm{mean}_j r_j) / \mathrm{std}_j r_j$. Show that this estimator has zero mean within each group. Argue why this matters for variance reduction in policy gradient estimation.
Exercise 15.13. A reasoning model is trained with GRPO on maths problems with verifiable rewards. The reward is $1$ if the final answer is correct and $0$ otherwise. The model's pass rate on the training distribution rises from $30\%$ at the start of training to $80\%$. Discuss how the variance of the GRPO advantage estimate changes through training and the implications for sample efficiency.
Exercise 15.14. Best-of-$N$ accuracy with a perfect verifier is $1 - (1 - p)^N$, where $p$ is the per-sample success rate. Best-of-$N$ accuracy with a verifier of precision $\rho$ and recall $\rho$ is much harder to compute. Set up the model and derive an upper bound.
Exercise 15.15. Lightman et al. (2023) showed PRMs outperform ORMs in best-of-$N$ at large $N$. Construct a simple toy example with 100 problems, two reasoning chains per problem, where the ORM and PRM make different selections for at least 30 of them, and argue why the PRM's selections are better.
Exercise 15.16. An induction head copies $B$ as a prediction whenever it sees the pattern $\ldots A B \ldots A$ in context. Construct a 12-token context in which the induction head should fire and predict a specific token. Trace which positions the previous-token head and induction head attend to.
Exercise 15.17. Explain why chain-of-thought prompting cannot help on a problem solvable in a single forward pass and a constant time budget. Conversely, explain why CoT can help on problems with high serial complexity, even with the same model.
Exercise 15.18. A Constitutional AI run uses a constitution with 16 principles. Each critique step samples one principle uniformly at random. Show that, if the model's first-pass response satisfies $k$ of the 16 principles, then the expected number of critique steps before all 16 have been considered is $16 \sum_{i=1}^{16-k} 1/i$.
Exercise 15.19. Implement a minimal ReAct loop in pseudocode with three tools: search_web(query), calculate(expression), and get_weather(city). Define the chat template, show the reasoning–action–observation interleaving, and define a stopping rule.
Exercise 15.20. Hybrid retrieval with BM25 and a dense encoder. Given a query, you have BM25 scores $b_i$ and dense cosine similarity $d_i$ for each candidate. Reciprocal rank fusion combines rankings as $s_i = 1/(k + r_i^B) + 1/(k + r_i^D)$, where $r_i^B$ and $r_i^D$ are the BM25 and dense ranks and $k$ is a constant. Argue why RRF is robust to score-scale differences between BM25 and cosine.
Exercise 15.21. A 1 M-token context allows you to fit ~3 MB of text. Estimate the cost (in tokens, dollars, and latency) of a single query at 2026 prices ($\sim\$3$ per million tokens of input, ~25 ms time-to-first-token for the first 1 K tokens scaling linearly thereafter).
Exercise 15.22. Compare GraphRAG and flat-chunk RAG on a question that requires combining information from three documents. Construct a small example (three short documents, one query) where GraphRAG returns useful subgraph context and flat RAG fails.
Exercise 15.23. Whisper is encoder-decoder. Why does it use this rather than a decoder-only architecture? Discuss the trade-offs in terms of latency, streaming, and accuracy.
Exercise 15.24. A diffusion video model generates 16-frame clips at 720p. Estimate the FLOPs cost per generated second, using the formula for diffusion sampling cost (model FLOPs $\times$ number of denoising steps, often 25–50). Compare to autoregressive video generation.
Exercise 15.25. Suppose MMLU saturates at 95% and a new benchmark reports a frontier model at 92%. What can you infer about the new benchmark? Discuss in terms of discriminative power and contamination risk.
Exercise 15.26. The "lost-in-the-middle" effect (Liu et al., 2023): in a 32 K-token context, accuracy on a question with the answer at position $k$ is highest when $k$ is near the start or the end. Propose three architectural or training interventions that might mitigate this effect.
Exercise 15.27. Compute the parameter savings from grouped-query attention with 8 KV heads versus 64 query heads, for a 70 B model with hidden dim 8192 and 80 layers. Express both in absolute parameters and as a percentage of the full attention block.
Exercise 15.28. A frontier lab releases a benchmark with 1000 problems. A researcher discovers that 50 of the problems appear in a popular pre-training corpus released two years before the benchmark. Describe two reasonable ways to handle this contamination, and the trade-offs of each.
Exercise 15.29. Train and DPO-tune a 100 M-parameter model on a single 24 GB GPU. Use the FineWeb-Edu pre-training mix (or any similar quality web subset) and a small SFT dataset. Report perplexity, MMLU-Pro, GSM8K and IFEval at three checkpoints: base, SFT, DPO. Discuss the gains.
Exercise 15.30. The DeepSeek-R1 paper claims that long chain-of-thought emerges spontaneously under GRPO with verifiable rewards, even from a base model with no SFT (R1-Zero). Replicate at small scale: take a 1 B base, set up GRPO on GSM8K with exact-match rewards, and observe whether average completion length grows during training.
Exercise 15.31. Compare the engineering complexity of PPO-RLHF and DPO. Specifically, count the number of model copies in memory, the number of optimisation passes per training step, and the failure modes you would monitor in production.
Exercise 15.32. Process reward models versus outcome reward models: design a synthetic experiment to test whether, on a fixed budget, PRM-guided best-of-$N$ outperforms ORM-guided best-of-$N$ as a function of $N$. State your hypothesis and the metric.