Attention & Transformers: Exercises

Dr Chris Paton

Exercises

Conceptual

Exercise 13.1 Explain in your own words why a fixed-size encoder hidden state was a bottleneck in early seq2seq models, and why attention removed it.

Exercise 13.2 Self-attention is permutation-equivariant. Demonstrate this with a small example: take any 3 × 3 input, compute attention, then permute the input rows and verify that the output rows permute the same way (assuming no positional encoding).

Exercise 13.3 Why does the dot-product variance scale with $d_k$? Derive it from first principles assuming independent unit-variance components.

Exercise 13.4 Consider a Transformer with $d_\text{model} = 512$, $h = 8$ heads. What is $d_k$? What would happen to representational capacity per head if you increased $h$ to 64?

Exercise 13.5 Explain why causal masking allows training on a full sequence in a single forward pass, even though inference is autoregressive.

Exercise 13.6 Describe in plain English (no equations) what positional encoding does and why a Transformer cannot work without it.

Exercise 13.7 Compare BERT and GPT in terms of: pretraining objective, architecture (encoder/decoder/both), fine-tuning style, and typical use cases.

Exercise 13.8 What is in-context learning, and why is it remarkable?

Exercise 13.9 Why is naive attention $O(n^2)$ in memory, and why is this the dominant constraint on long-context Transformers?

Exercise 13.10 Explain how FlashAttention can be exact while using only $O(n)$ memory.

Numerical

Exercise 13.11 Compute by hand the attention output for $\mathbf{Q} = \mathbf{K} = \mathbf{V} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$, $d_k = 2$.

Exercise 13.12 For a Transformer with $L = 24$ layers, $d = 1024$, $V = 50000$, estimate the parameter count using the $12 L d^2 + 2 V d$ rule.

Exercise 13.13 Estimate the training compute (in FLOPs) for a 13B-parameter model trained on 2T tokens.

Exercise 13.14 A model has 70B parameters, $L = 80$, $h = 64$, $d_k = 128$. In fp16, how much KV cache (in MB) does each token consume? For a 32K-token context, what is the total KV cache size?

Exercise 13.15 Sinusoidal positional encoding: compute PE(0, 0), PE(0, 1), PE(1, 0), PE(1, 1) for $d_\text{model} = 4$.

Exercise 13.16 RoPE: write out the rotation matrix applied to query dimensions $(q_0, q_1)$ at position $m$ with base frequency $\omega_0 = 1$.

Exercise 13.17 A model has $V = 32000$, $d = 4096$, $L = 32$. Compute the parameter count to within 5%.

Exercise 13.18 Chinchilla scaling: for a fixed compute budget of $10^{22}$ FLOPs, what is the compute-optimal $(N, D)$ pair?

Implementation

Exercise 13.19 Implement scaled dot-product attention in PyTorch in fewer than 10 lines. Verify against torch.nn.functional.scaled_dot_product_attention on a small random input.

Exercise 13.20 Modify the CausalSelfAttention class from §13.10 to use F.scaled_dot_product_attention instead of the manual implementation. Confirm the outputs match to floating-point tolerance.

Exercise 13.21 Replace the sinusoidal positional encoding in the §13.10 GPT with learned positional embeddings. Train both on tiny Shakespeare for 5000 steps and compare validation loss.

Exercise 13.22 Implement RMSNorm and SwiGLU. Replace LayerNorm and GELU-FFN in the §13.10 GPT and retrain. Does it train? Does it converge faster?

Exercise 13.23 Implement multi-query attention (a single shared K/V projection for all heads). Estimate the savings in KV cache size for a 12-layer, 12-head model.

Exercise 13.24 Build a tiny encoder-only Transformer for binary text classification (e.g. IMDB sentiment). Pretrain with masked LM on a small corpus, fine-tune on the classification task. Report accuracy.

Exercise 13.25 Add a simple speculative-decoding loop to the §13.10 GPT: train a 1-layer "draft" model, then use it to propose 4 tokens at a time which a 4-layer "target" model verifies. Measure the throughput improvement.

Theory

Exercise 13.26 Show that for any fixed offset $k$, the sinusoidal positional encoding at position $\text{pos} + k$ is a linear function of the encoding at position $\text{pos}$, with the linear map depending only on $k$.

Exercise 13.27 Show that the RoPE rotated dot product depends only on the relative position $n - m$, using the property $\mathcal{R}_m^\top \mathcal{R}_n = \mathcal{R}_{n - m}$.

Exercise 13.28 Suppose attention scores are computed without the $\sqrt{d_k}$ scaling. Show that as $d_k \to \infty$, the softmax converges to a one-hot distribution at the argmax (under independent unit-variance assumptions).

Exercise 13.29 Derive the per-token inference FLOP count $\sim 2N$ for a decoder-only Transformer.

Exercise 13.30 Multi-head attention with $h$ heads uses $4 d^2$ parameters regardless of $h$. Show this by writing out the per-head parameter counts and summing.

Exercise 13.31 Show that pre-norm and post-norm Transformers are not mathematically equivalent: construct a small example where the same weights produce different outputs.

Exercise 13.32 Estimate the asymptotic complexity of Performer attention with feature-map dimension $r$. Why is it $O(n)$ in $n$?