Attention & Transformers: Selected solutions

Dr Chris Paton

Selected solutions

Solution 13.3. Let $q_i, k_i$ be independent random variables with mean 0 and variance 1. Then $\mathbb{E}[q_i k_i] = \mathbb{E}[q_i] \mathbb{E}[k_i] = 0$. The variance of the sum $\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i$ is $\sum_i \operatorname{Var}(q_i k_i)$. Each term has $\operatorname{Var}(q_i k_i) = \mathbb{E}[(q_i k_i)^2] - 0 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] = 1 \cdot 1 = 1$. So $\operatorname{Var}(\mathbf{q}^\top \mathbf{k}) = d_k$.

Solution 13.4. $d_k = 512 / 8 = 64$. With $h = 64$, $d_k = 8$, far below the rule of thumb of $\geq 32$. Each head would have a very narrow representation; attention scores would be dominated by noise, and pruning experiments suggest most of those heads would converge to redundant or trivial behaviour.

Solution 13.5. With a causal mask, the output at position $i$ depends only on positions $\leq i$. So the model's prediction at position $i$ is a function of $w_1, \dots, w_i$. The training target at position $i$ is $w_{i+1}$. A single forward pass on $w_1, \dots, w_n$ produces $n$ next-token predictions, each conditioned on the correct prefix, in parallel. At inference, you must generate one token, append it, and re-run; that breaks the parallelism in the time dimension.

Solution 13.11. $\mathbf{Q} \mathbf{K}^\top = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$. Divide by $\sqrt{2}$: $\begin{bmatrix} 0.707 & 0 \\ 0 & 0.707 \end{bmatrix}$. Softmax row 1: $[e^{0.707}, e^0] = [2.028, 1]$, normalised $[0.670, 0.330]$. Row 2 by symmetry: $[0.330, 0.670]$. Multiply by $\mathbf{V}$: row 1 = $0.670 \cdot (1, 0) + 0.330 \cdot (0, 1) = (0.670, 0.330)$. Row 2 = $(0.330, 0.670)$.

Solution 13.12. Layers: $12 \cdot 24 \cdot 1024^2 = 12 \cdot 24 \cdot 1{,}048{,}576 = 301{,}989{,}888 \approx 302$M. Embeddings: $2 \cdot 50000 \cdot 1024 = 102{,}400{,}000 \approx 102$M. Total $\approx 404$M.

Solution 13.13. Training FLOPs $\approx 6 N D = 6 \cdot 13 \times 10^9 \cdot 2 \times 10^{12} = 1.56 \times 10^{23}$ FLOPs.

Solution 13.14. Per token: $2 \cdot L \cdot h \cdot d_k \cdot 2$ bytes $= 2 \cdot 80 \cdot 64 \cdot 128 \cdot 2 = 2{,}621{,}440$ bytes $= 2.5$ MB. For 32K tokens: $32{,}768 \cdot 2.5$ MB $\approx 80$ GB.

Solution 13.15. $\omega_0 = 10000^{-0/4} = 1$, $\omega_1 = 10000^{-2/4} = 1/100$. So PE(0, 0) = $\sin(0) = 0$. PE(0, 1) = $\cos(0) = 1$. PE(1, 0) = $\sin(1) \approx 0.841$. PE(1, 1) = $\cos(1) \approx 0.540$.

Solution 13.16. $\begin{bmatrix} \cos m & -\sin m \\ \sin m & \cos m \end{bmatrix}$.

Solution 13.17. Layers: $12 \cdot 32 \cdot 4096^2 = 12 \cdot 32 \cdot 16{,}777{,}216 = 6{,}442{,}450{,}944 \approx 6.44$B. Embeddings: $2 \cdot 32000 \cdot 4096 = 262{,}144{,}000 \approx 262$M. Total $\approx 6.7$B.

Solution 13.18. Chinchilla recommends $D \approx 20 N$. The compute is $C = 6 N D = 120 N^2$. So $N = \sqrt{C / 120} = \sqrt{10^{22}/120} \approx \sqrt{8.33 \times 10^{19}} \approx 9.13 \times 10^9$. So $N \approx 9$B parameters and $D \approx 180$B tokens.

Solution 13.19.

def sdpa(q, k, v, mask=None):
    d_k = q.size(-1)
    s = q @ k.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        s = s.masked_fill(mask == 0, float('-inf'))
    return F.softmax(s, dim=-1) @ v

Solution 13.26. $\sin(\omega(\text{pos}+k)) = \sin(\omega \text{pos})\cos(\omega k) + \cos(\omega \text{pos})\sin(\omega k)$ and similarly for cosine. Stacking the $(\sin, \cos)$ pair as a 2-vector, the encoding at $\text{pos}+k$ equals a rotation by $\omega k$ applied to the encoding at $\text{pos}$. The rotation matrix depends only on $k$, not $\text{pos}$. Stacking all $d_\text{model}/2$ pairs gives a block-diagonal linear map $T_k$ such that $\text{PE}(\text{pos}+k) = T_k\, \text{PE}(\text{pos})$.

Solution 13.27. Inner product: $\langle \mathcal{R}_m \mathbf{q}, \mathcal{R}_n \mathbf{k} \rangle = \mathbf{q}^\top \mathcal{R}_m^\top \mathcal{R}_n \mathbf{k}$. Since $\mathcal{R}$ are 2-D rotations, $\mathcal{R}_m^\top \mathcal{R}_n = \mathcal{R}_{-m} \mathcal{R}_n = \mathcal{R}_{n-m}$. So the inner product equals $\mathbf{q}^\top \mathcal{R}_{n-m} \mathbf{k}$, which depends only on $n-m$.

Solution 13.28. As $d_k \to \infty$, dot-product magnitudes grow as $\sqrt{d_k}$. Softmax of $(z_1, \dots, z_n)$ with $z_i$ growing in magnitude approaches the indicator of the argmax. By a standard concentration argument, the relative gap between the argmax logit and the second-largest grows in probability with $d_k$, so the softmax mass concentrates entirely on the argmax. This is exactly why $\sqrt{d_k}$ scaling is necessary.

Solution 13.29. A forward pass through one Transformer layer for one token costs roughly $4 d^2$ FLOPs for attention projections and $8 d^2$ for the FFN, together $12 d^2$ per layer per token. Over $L$ layers that is $12 L d^2 \approx N$ (excluding embedding parameters which contribute roughly equally to forward FLOPs). A multiply-add counts as 2 FLOPs in the conventional accounting, so per-token forward FLOPs are $\sim 2N$.

Solution 13.30. Per head: $\mathbf{W}^Q_j, \mathbf{W}^K_j, \mathbf{W}^V_j$ each of size $d \times d_k$, with $d_k = d/h$. Each is $d \cdot d/h = d^2/h$ params. Three of them per head: $3 d^2 / h$. Over $h$ heads: $3 d^2$. Plus the output projection $\mathbf{W}^O \in \mathbb{R}^{d \times d}$: $d^2$. Total: $4 d^2$, independent of $h$.

Solution 13.32. Performer factors attention as $\phi(\mathbf{Q}) (\phi(\mathbf{K})^\top \mathbf{V})$. The inner product $\phi(\mathbf{K})^\top \mathbf{V}$ is a sum over $n$ of outer products of $r$- and $d_v$-dim vectors, costing $O(n r d_v)$. Multiplying $\phi(\mathbf{Q})$ ($n \times r$) by the result ($r \times d_v$) costs $O(n r d_v)$. Total: $O(n r d_v)$, linear in $n$.

Textbook of AI

Selected solutions

Further Learning