Sequence Models: Solution sketches

Dr Chris Paton

Solution sketches

Exercise 12.1. Under the uniform distribution, $\log P(w_t \mid w_{\lt t}) = \log(1/V) = -\log V$ for every position. Average per-token negative log-likelihood is $\log V$, and perplexity is $\exp(\log V) = V$. The general identity follows from $\mathrm{PPL} = \exp\bigl( -\frac{1}{M} \sum \log P \bigr)$ exactly.

Exercise 12.2. $P(\text{the cat sat}) = 0.1 \cdot 0.05 \cdot 0.2 \cdot 0.1 = 10^{-4}$. There are 4 transitions. Average negative log-likelihood is $-\frac{1}{4} \log(10^{-4}) = \log 10$. Perplexity $= \exp(\log 10) = 10$.

Exercise 12.3. $\sum_{w_2} (c(w_1, w_2) + 1) = c(w_1) + V$ (since $\sum_{w_2} c(w_1, w_2) = c(w_1)$ and there are $V$ values of $w_2$). Hence the smoothed conditional sums to $1$.

Exercise 12.6. Differentiating the second term of $\mathcal{L}_{\mathrm{NS}}$ with respect to $u_{w_k}$ gives $\partial / \partial u_{w_k} \log \sigma(-u_{w_k}^\top v_{w_t}) = -\sigma(u_{w_k}^\top v_{w_t}) v_{w_t}$, using the identity $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ and $\frac{d}{dz} \log \sigma(-z) = -\sigma(z)$. The gradient pushes $u_{w_k}$ in the $-v_{w_t}$ direction with magnitude $\sigma(u_{w_k}^\top v_{w_t})$, which is large precisely when the noise word is currently judged to be a likely context word.

Exercise 12.11. The step-Jacobian is $J = \mathrm{diag}(1 - \tanh^2 z) W_{hh}$. The diagonal entries lie in $(0, 1]$ with $1$ attained only at $z = 0$. The spectral radius is not submultiplicative in general, so we work in the spectral norm: $\|J\|_2 \le \|\mathrm{diag}(1 - \tanh^2 z)\|_2\, \|W_{hh}\|_2 = \max_i(1 - \tanh^2 z_i)\, \|W_{hh}\|_2$. Since $\rho(\cdot) \le \|\cdot\|_2$ for any matrix, $\rho(J) \le \|J\|_2 \le \max_i(1 - \tanh^2 z_i)\, \|W_{hh}\|_2$, with equality only when the saturation factor is one (i.e. $z = 0$) and $W_{hh}$ is normal. If $\|W_{hh}\|_2 < 1$ the product of step-Jacobians shrinks geometrically and the gradient vanishes.

Exercise 12.14. With the gates treated as constants, $c_t = f_t \odot c_{t-1} + (\text{terms not depending on } c_{t-1})$. Hence $\partial c_t / \partial c_{t-1} = \mathrm{diag}(f_t)$. By induction, $\partial c_t / \partial c_{t-k} = \mathrm{diag}(f_t \odot f_{t-1} \odot \cdots \odot f_{t-k+1})$.

Exercise 12.15. Recurrence: $c_t = 0.99 c_{t-1} + 0.01$. This is a linear first-order recurrence with fixed point $c^* = 0.01 / (1 - 0.99) = 1$. Solution: $c_T = 1 - (1 - c_0)(0.99)^T = 1 - 0.99^T$. Limit is $1$.

Exercise 12.17. A bidirectional RNN's hidden state at position $t$ depends on $w_{t+1}, \ldots, w_T$, so the resulting "next-token" prediction $P(w_{t+1} \mid h_t)$ is conditioned on information including $w_{t+1}$ itself. The chain-rule factorisation requires strict causality, which the bidirectional model violates.

Exercise 12.20. Original $\alpha = \mathrm{softmax}(1, 2, 3, -1) \approx (0.089, 0.242, 0.657, 0.012)$. Halving the scores ($\tau = 0.5$ in the multiplicative sense, equivalently temperature 2 in the divisive sense) flattens to $\approx (0.174, 0.287, 0.474, 0.064)$. Doubling sharpens to $\approx (0.016, 0.117, 0.867, 0.0003)$. Sharper attention is more decisive but less robust.

Exercise 12.25. Cumulative probabilities are $0.5, 0.8, 1.0$. The standard convention is to add tokens in order until the cumulative mass meets or exceeds $p$. For $p = 0.7$ the nucleus is $\{w_1, w_2\}$ (since $0.5 < 0.7 \le 0.8$). For $p = 0.8$ nucleus $= \{w_1, w_2\}$. For $p = 0.95$ nucleus $= \{w_1, w_2, w_3\}$.

Exercise 12.26. As beam width grows, the top output is determined by maximum-likelihood-trained model probabilities, which are systematically biased towards short and degenerate outputs (the empty sequence often has high likelihood under maximum-likelihood-trained models). Narrow beams accidentally sample diverse and longer outputs. The fundamental issue is a mismatch between maximum-likelihood training and what we actually want at decoding time.

Exercise 12.28. $\tau = 1$: $P_\tau(w) = e^{z_w} / \sum_w e^{z_w}$, exactly the model's softmax. $\tau \to 0$: $P_\tau$ concentrates on $\arg\max_w z_w$ because the difference $z_{w^*} - z_w$ is amplified to infinity by division by $\tau$, sending all but the argmax to probability zero in the softmax.

Exercise 12.31. Embedding table: $V d$. LSTM layer 1: $4 H (d + H + 1) \approx 4Hd + 4H^2$ for the four gates, with $d$ the embedding dimension and $H$ the hidden size. LSTM layer 2: $4 H (H + H + 1) = 8H^2 + 4H$. Output projection: $H V$. Total: $V d + 4H(d + H + 1) + 8H^2 + 4H + HV$, which simplifies to $Vd + HV + 4Hd + 12H^2$ ignoring the bias terms. With $V = 30000$, $d = 256$, $H = 1024$ this evaluates to roughly $7.7\text{M} + 30.7\text{M} + 1.0\text{M} + 12.6\text{M} \approx 52$ M parameters, with the output projection dominating, the LSTM weights second, and the embedding third.

Exercise 12.32. Transformer is $T^2 d$ per layer; RNN is $T d^2$ per layer. They are equal at $T = d$. For $T < d$ the Transformer is cheaper; for $T > d$ the RNN is cheaper per layer, though in practice modern hardware much prefers the Transformer's parallel structure. With $d = 1024$ a sequence of 2048 tokens has Transformer cost $\approx 4 \cdot 10^9$ and RNN cost $\approx 2 \cdot 10^9$ FLOPs per layer; the RNN is theoretically half the FLOPs, but cannot be parallelised across $T$, so wall-clock time on a GPU may be many times worse.