A GPT-style language model factorises the joint probability of a sequence as a product of conditionals:
$$P_\theta(x_1, \ldots, x_T) = \prod_{t=1}^T P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$
Training maximises the log-likelihood on a corpus $\mathcal{D}$:
$$\mathcal{L}(\theta) = -\sum_{x \in \mathcal{D}} \sum_{t=1}^T \log P_\theta(x_t \mid x_{\lt t})$$
, equivalently, minimising cross-entropy at each next-token prediction.
Architecture: a stack of decoder-only Transformer blocks with causal self-attention, position $t$ attends only to positions $\leq t$. Each block contains:
$$z_t^{(l)} = \mathrm{LayerNorm}(h_t^{(l-1)})$$ $$h_t^{(l)} = h_t^{(l-1)} + \mathrm{MultiHeadAttn}^{\mathrm{causal}}(z^{(l)})$$ $$h_t^{(l)} \mathrel{+}= \mathrm{FFN}(\mathrm{LayerNorm}(h_t^{(l)}))$$
Final logits: $\ell_t = h_t^{(L)} W_\mathrm{out}^\top$ with vocabulary distribution $P(x_{t+1} \mid x_{1:t}) = \mathrm{softmax}(\ell_t)$. Tied embeddings $W_\mathrm{out} = W_\mathrm{embed}$ are common, halving parameter count for large vocabularies.
Sampling from a trained model:
- Greedy: $x_{t+1} = \arg\max P(\cdot \mid x_{1:t})$. Often produces repetitive or boring text.
- Temperature: $P_T(x) \propto P(x)^{1/T}$. $T < 1$ sharpens, $T > 1$ smooths.
- Top-$k$: restrict to the $k$ most probable tokens, renormalise.
- Top-$p$ (nucleus): keep the smallest set whose cumulative probability exceeds $p$.
Training compute: GPT-3 (175B parameters) trained on 300B tokens used approximately $3.14 \times 10^{23}$ FLOPs ≈ thousands of GPU-years. The Kaplan/Chinchilla scaling laws relate compute, parameters, and data:
$$L(N, D) \approx \left(\frac{N_c}{N}\right)^{0.34} + \left(\frac{D_c}{D}\right)^{0.28}$$
with constants $N_c \sim 10^{13}$ parameters and $D_c \sim 10^{13}$ tokens calibrated empirically.
In-context learning: at inference, the model conditions on a prompt containing examples and a new query, predicting the answer purely from the conditional distribution it learned during pre-training, no gradient updates. The mechanism by which this works is the subject of ongoing research; current best understanding is that attention layers implement an implicit form of gradient descent over the in-context examples.
Video
Related terms: GPT, alec-radford, Transformer, Cross-Entropy Loss, Scaling Laws
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers