Exercises
The exercises are graded by difficulty: routine (R), thought (T), implementation (I), open-ended (O).
Exercise 12.1 (R). Verify by hand that perplexity equals $V$ for a model that assigns the uniform distribution over a vocabulary of size $V$. Show that perplexity is the exponential of average per-token cross-entropy.
Exercise 12.2 (R). A bigram language model trained on a corpus assigns $P(\text{the} \mid \text{\lt bos>}) = 0.1$, $P(\text{cat} \mid \text{the}) = 0.05$, $P(\text{sat} \mid \text{cat}) = 0.2$, $P(\text{\lt eos>} \mid \text{sat}) = 0.1$. Compute the perplexity of the sentence "the cat sat".
Exercise 12.3 (R). Show that under add-one (Laplace) smoothing, the bigram probability $\hat P_{+1}(w_2 \mid w_1) = (c(w_1, w_2) + 1) / (c(w_1) + V)$ defines a valid probability distribution: $\sum_{w_2} \hat P_{+1}(w_2 \mid w_1) = 1$.
Exercise 12.4 (R). Compute by hand the bigram counts and add-one-smoothed bigram probabilities for the corpus "I am Sam. Sam I am. I do not like green eggs and ham." Treat sentences as separate, with $\langle\text{BOS}\rangle$ and $\langle\text{EOS}\rangle$ inserted.
Exercise 12.5 (T). Why does the "linear analogy" property, vec("king") − vec("man") + vec("woman") $\approx$ vec("queen"), emerge in word2vec? Sketch the connection to the implicit factorisation of the shifted PMI matrix (Levy and Goldberg 2014).
Exercise 12.6 (T). Show that for word2vec with negative sampling, the gradient of the per-pair loss with respect to a noise-word's context vector $u_{w_k}$ is $-\sigma(u_{w_k}^\top v_{w_t}) v_{w_t}$. Interpret the magnitude and direction.
Exercise 12.7 (R). Run BPE training by hand on the corpus "abab abc abc bc" with a target vocabulary of size 8 (initial characters plus $K$ merges). Show the merges in order.
Exercise 12.8 (R). Suppose a WordPiece tokeniser is trained with the score $\mathrm{score}(x, y) = c(xy) / (c(x) c(y))$. Why does this score favour pairs that are not just frequent but informative? Give a small numerical example where pair frequency and pair score disagree on which to merge.
Exercise 12.9 (I). Implement a simple BPE training routine in Python. Train it on the first chapter of Pride and Prejudice. Report the first 20 merges in order.
Exercise 12.10 (R). For a vanilla RNN with $H = 1$, $W_{hh} = w$, $W_{xh} = 1$, $b_h = 0$, and $\tanh$ non-linearity, derive a closed-form expression for $h_t$ in terms of $h_0$ and the inputs $x_1, \ldots, x_t$ in the special case where every $x_t = 0$ (i.e., free dynamics).
Exercise 12.11 (T). Show that the spectral radius of the step-Jacobian of a $\tanh$ RNN is at most $\rho(W_{hh})$, with equality only in the limit of zero pre-activations. Conclude that simple RNNs with $\rho(W_{hh}) < 1$ have vanishing gradients.
Exercise 12.12 (T). Derive $\partial \mathcal{L}_t / \partial b_h$ for a vanilla RNN and show that it, too, is a sum of terms each involving a product of step-Jacobians. (This shows that the vanishing-gradient problem affects all parameters, not just $W_{hh}$.)
Exercise 12.13 (I). Implement gradient clipping in PyTorch (or read the source of torch.nn.utils.clip_grad_norm_). Train a vanilla RNN on the adding problem (sequence length $T = 50$) with and without clipping. Report training stability.
Exercise 12.14 (T). In an LSTM, derive $\partial c_t / \partial c_{t-1}$, $\partial c_t / \partial c_{t-2}$, …, $\partial c_t / \partial c_{t-k}$ assuming $f_t, i_t, \tilde c_t$ are constants (do not depend on $c_{t-1}$). Conclude that $\partial c_t / \partial c_{t-k}$ is the elementwise product of the forget gates $f_{t-k+1}, \ldots, f_t$.
Exercise 12.15 (R). Given an LSTM with $H = 1$, suppose at every step $f_t = 0.99$, $i_t = 0.01$, $\tilde c_t = 1$, $c_0 = 0$. Find $c_T$ in closed form as a function of $T$. What is $\lim_{T \to \infty} c_T$?
Exercise 12.16 (T). Show that the GRU update $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t$ can be reinterpreted as a weighted convex combination, and discuss the limit cases $z_t = 0$ and $z_t = 1$.
Exercise 12.17 (T). Why does a bidirectional RNN not define a generative language model? Answer by reference to the chain-rule factorisation of $P(w_{1:T})$.
Exercise 12.18 (I). Build a bidirectional LSTM tagger for English part-of-speech tagging using the Penn Treebank. Use Glove-100 embeddings and report token-level accuracy on the standard development split.
Exercise 12.19 (T). In a vanilla seq2seq model without attention, how does the gradient from a target token $y_t$ to an early input token $x_1$ flow? Identify each step at which the gradient may shrink, and explain how attention shortens the path.
Exercise 12.20 (R). Compute the attention weights $\alpha_{tj}$ from scores $e_t = (1, 2, 3, -1)$. Then recompute after multiplying all scores by $\tau = 0.5$ and again by $\tau = 2.0$. Comment on the effect of "temperature" on attention sharpness.
Exercise 12.21 (T). Bahdanau-style additive attention has roughly $4H^2 + H$ parameters when the encoder is bidirectional with hidden size $2H$ feeding a decoder hidden size $H$ (for $\mathbf{W}_a \in \mathbb{R}^{H\times H}$, $\mathbf{U}_a \in \mathbb{R}^{H\times 2H}$, and $\mathbf{v}_a \in \mathbb{R}^H$). Luong-style dot-product attention has zero learnable attention parameters (when the encoder and decoder use the same hidden size). What is the trade-off?
Exercise 12.22 (I). Implement Bahdanau attention as a PyTorch nn.Module. Train an encoder–decoder on the Tatoeba English–French pairs, with and without attention. Compare BLEU.
Exercise 12.23 (T). Show that for the CTC formulation, the number of distinct paths $\pi$ that collapse to a length-$L$ label sequence given $T$ frames is at least $\binom{T - L}{L}$ when blanks are required between repeated labels, and discuss why the forward–backward algorithm is necessary rather than enumeration.
Exercise 12.24 (R). Carry out one step of the CTC forward recursion by hand for $T = 4$, target sequence "ab", with hypothetical per-frame probabilities $p_t(\varnothing) = 0.4$, $p_t(a) = 0.4$, $p_t(b) = 0.2$ (constant across $t$). Compute $\alpha_t(s)$ for $s = 1, \ldots, 5$ and $t = 1, \ldots, 4$.
Exercise 12.25 (R). Suppose a model assigns probabilities $(0.5, 0.3, 0.2)$ to its top three tokens at some step. Compute the top-$p$ nucleus for $p = 0.7$, $p = 0.8$, $p = 0.95$.
Exercise 12.26 (T). Why does beam search with a wide beam often produce worse output than a narrow beam, on translation tasks? (The "beam search curse".)
Exercise 12.27 (I). Implement length-normalised beam search and compare BLEU against unnormalised beam search on a translation task with mixed sentence lengths.
Exercise 12.28 (R). Show that temperature sampling at $\tau = 1$ recovers ancestral sampling and at $\tau \to 0$ recovers greedy decoding.
Exercise 12.29 (I). Train the character-level LSTM in §12.15 on the complete works of Shakespeare. Vary hidden_dim over $\{64, 128, 256, 512\}$ and report training perplexity, sample quality, and wall-clock time.
Exercise 12.30 (I). Implement truncated BPTT for the character-level LSTM, carrying hidden state forward across batches. Compare training stability and final perplexity to the version where hidden state is reset every batch.
Exercise 12.31 (T). A 2-layer LSTM with hidden size 1024 has roughly how many parameters as a function of input embedding dimension $d$ and vocabulary size $V$? Show your working, accounting for the embedding table, the LSTM cells, and the output projection.
Exercise 12.32 (T). The Transformer (Chapter 13) has $O(T^2 d)$ time and memory complexity per layer for sequence length $T$ and hidden size $d$. A recurrent network has $O(T d^2)$. For what values of $T$ and $d$ does each architecture have lower complexity?
Exercise 12.33 (O). Read Chen, Firat et al. (2018), "The best of both worlds: combining recent advances in neural machine translation". Identify three findings about the relative strengths of recurrent and Transformer architectures and discuss how each might generalise (or not) to modern LLM-scale settings.