Solutions to selected exercises

Solution 15.1. Lagrangian:

$$ \mathcal{L}(N, D, \lambda) = E + A N^{-\alpha} + B D^{-\beta} - \lambda(6 N D - C). $$

Setting derivatives to zero:

$$ -\alpha A N^{-\alpha - 1} = 6 \lambda D, \qquad -\beta B D^{-\beta - 1} = 6 \lambda N. $$

Dividing the first by the second:

$$ \frac{\alpha A}{\beta B} \cdot \frac{D^{\beta + 1}}{N^{\alpha + 1}} = \frac{D}{N}. $$

Rearranging gives $N^\alpha / D^\beta = \mathrm{const}$, so $D \propto N^{\alpha / \beta}$. Combining with $C \propto N D$ yields $N \propto C^{\beta / (\alpha + \beta)}$ and $D \propto C^{\alpha / (\alpha + \beta)}$. When $\alpha = \beta$, both exponents equal $1/2$.

Solution 15.2. Substituting $D = 20N$ into $C = 6ND$ gives $C = 120 N^2$, so $N = \sqrt{C/120}$. At $C = 10^{24}$ this is $N \approx 9.1 \times 10^{10}$ parameters with $D \approx 1.83 \times 10^{12}$ tokens, i.e. a 91 B model trained on 1.83 T tokens. Kaplan's $N \propto C^{0.73}$ rule would put the optimum at something closer to $N \approx 5 \times 10^{11}$, $D \approx 3 \times 10^{11}$, a roughly five-times-larger model trained on a sixth of the data, the GPT-3 shape. The Chinchilla-shaped model performs strictly better at the same compute.

Solution 15.4. With $\varepsilon = C^{-0.05}$ and $k = 8$, exact-match accuracy is $(1 - C^{-0.05})^8$. On a log-log plot against $C$, this stays near zero for small $C$ (when $\varepsilon \approx 1$), then rises sharply through a knee near $\varepsilon \approx 1/k = 1/8$ (i.e., $C \approx 8^{20} \approx 10^{18}$), and saturates near 1. The per-token accuracy $1 - \varepsilon$ rises smoothly. The "phase transition" in exact-match is purely a consequence of compounding: it is exactly $\varepsilon \approx 1/k$ that triggers it, and $k$ is a property of the metric, not the model.

Solution 15.5. $C = 6 \cdot 70 \times 10^9 \cdot 1.4 \times 10^{12} \approx 5.9 \times 10^{23}$ FLOPs. A 4096-GPU H100 cluster delivers $4096 \cdot 989 \times 10^{12} \approx 4.05 \times 10^{18}$ FLOPs/s peak, $2.0 \times 10^{18}$ FLOPs/s at 50% MFU. Wall-clock: $5.9 \times 10^{23} / 2.0 \times 10^{18} \approx 2.95 \times 10^5$ seconds $\approx 82$ hours $\approx 3.4$ days.

Solution 15.6. Substituting $r(x, y) \to r(x, y) + c(x)$ into the Bradley–Terry probability:

$$ \sigma\bigl((r(x, y_w) + c(x)) - (r(x, y_l) + c(x))\bigr) = \sigma(r(x, y_w) - r(x, y_l)), $$

unchanged. Practical consequence: only differences of rewards within a prompt are identifiable. Reward models trained on preferences have a per-prompt arbitrary additive constant. This is fine for ranking but means absolute reward values from one prompt are not comparable to those from another.

Solution 15.7. Let $\Delta = r_\phi(x, y_w) - r_\phi(x, y_l)$. Then $\mathcal{L} = -\log \sigma(\Delta)$ and

$$ \frac{\partial \mathcal{L}}{\partial \phi} = -\frac{\sigma'(\Delta)}{\sigma(\Delta)} \frac{\partial \Delta}{\partial \phi} = -\bigl(1 - \sigma(\Delta)\bigr) \Bigl(\nabla_\phi r_\phi(x, y_w) - \nabla_\phi r_\phi(x, y_l)\Bigr). $$

The gradient is weighted by the probability that the model currently assigns the wrong preference, $1 - \sigma(\Delta)$, so easy correctly-classified examples contribute little. This is the same shape as the gradient of binary logistic regression.

Solution 15.8. Form the Lagrangian for the constrained problem

$$ \max_\pi \int \pi(y \mid x)\bigl[r(x, y) - \beta \log \pi(y \mid x) + \beta \log \pi_{\text{ref}}(y \mid x)\bigr] dy $$

subject to $\int \pi = 1$. Functional derivative with respect to $\pi(y \mid x)$:

$$ r(x, y) - \beta \log \pi(y \mid x) - \beta + \beta \log \pi_{\text{ref}}(y \mid x) - \lambda(x) = 0. $$

Solving for $\pi$:

$$ \pi^*(y \mid x) = \pi_{\text{ref}}(y \mid x) \exp\bigl(r(x, y)/\beta - 1 - \lambda(x)/\beta\bigr). $$

Absorbing the constants into $Z(x)$ via the normalisation gives the stated form. Assumptions: $\pi_{\text{ref}}$ has full support over the response space (otherwise the KL is undefined); the KL is taken as $\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})$, which is mode-seeking.

Solution 15.10. Numerically stable DPO loss:

def dpo_loss(lp_pol_w, lp_pol_l, lp_ref_w, lp_ref_l, beta=0.1):
    log_ratio_w = lp_pol_w - lp_ref_w
    log_ratio_l = lp_pol_l - lp_ref_l
    logits = beta * (log_ratio_w - log_ratio_l)
    # Use F.logsigmoid for numerical stability
    return -F.logsigmoid(logits).mean()

F.logsigmoid(x) is equivalent to $\log \sigma(x)$ but avoids overflow when $x$ is large and underflow when $x$ is large and negative.

Solution 15.12. Within a group of $G$ samples, the empirical advantage estimate is

$$ \hat A_i = \frac{r_i - \mu}{\sigma}, \qquad \mu = \tfrac{1}{G} \sum_j r_j, \quad \sigma = \mathrm{std}(r). $$

Sum: $\sum_i \hat A_i = (\sum_i r_i - G \mu)/\sigma = 0$. Variance reduction follows because the policy gradient estimator is

$$ \hat g = \frac{1}{G} \sum_i \hat A_i \nabla_\theta \log \pi_\theta(y_i \mid x). $$

A constant baseline (here $\mu$) added to the advantage is unbiased (by the standard log-derivative trick) and reduces variance proportional to the correlation between rewards and gradients within the group. Group relativisation works particularly well when the absolute reward scale varies wildly between prompts.

Solution 15.13. Early in training, $r_i$ is mostly 0 with a few 1s; mean is small, std is moderate, so advantages are close to $\pm \sqrt{p/(1-p)}$, well-defined gradient signal. Late in training, with $p = 0.8$, most $r_i$ are 1; std shrinks; advantages of the rare 0-reward samples are hugely negative ($\hat A \approx -2$) while 1-reward samples have small positive advantage. Sample efficiency degrades: most of the gradient comes from rare failure cases, and the policy gradient becomes increasingly noisy. Practical mitigations include curriculum (move to harder problems) and rejection sampling (drop near-100% pass-rate problems from the buffer).

Solution 15.27. Standard multi-head attention with 64 heads, head dim $128$ (so $H \cdot d_h = 8192$): per-layer projections are $W_Q, W_K, W_V \in \mathbb{R}^{8192 \times 8192}$, each $6.7 \times 10^7$ params; total $K + V$ projection: $1.34 \times 10^8$. With GQA-8: $W_K, W_V \in \mathbb{R}^{8192 \times 1024}$, each $8.4 \times 10^6$ params; total $K + V$ projection: $1.68 \times 10^7$. Saving per layer: $1.18 \times 10^8$ params. Across 80 layers: $9.4 \times 10^9$ params saved, about 13% of the 70 B total. The KV cache savings at inference time are equally important: 8× smaller, allowing 8× more concurrent users at the same memory budget.

Solution 15.31. PPO-RLHF holds in memory: policy, reference, reward model, and value model, four full models, plus optimiser state for the policy and value networks. Per training step requires four forward passes and two backward passes. Failure modes include reward hacking (the policy finds adversarial inputs to the reward model), KL blowup (the policy drifts too far from the reference, the KL penalty cannot pull it back), value-model divergence, and instability from the importance ratio. DPO holds in memory: policy and reference, two models, optimiser state for the policy. Per step requires two forward passes and one backward pass. Failure modes include preference data quality (DPO has no slack against systematic annotator bias), reference-policy quality (a bad SFT poisons everything), and the IPO-style blow-up where the loss can drive implicit rewards to infinity on unanimous preferences. DPO is roughly $5\times$ cheaper per training step and an order of magnitude simpler in terms of distributed-training plumbing.

Further Learning

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).