GPT (mathematical detail), Glossary, Textbook of AI

A GPT-style language model factorises the joint probability of a sequence as a product of conditionals:

$$P_\theta(x_1, \ldots, x_T) = \prod_{t=1}^T P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$

Training maximises the log-likelihood on a corpus $\mathcal{D}$:

$$\mathcal{L}(\theta) = -\sum_{x \in \mathcal{D}} \sum_{t=1}^T \log P_\theta(x_t \mid x_{\lt t})$$

, equivalently, minimising cross-entropy at each next-token prediction.

Architecture: a stack of decoder-only Transformer blocks with causal self-attention, position $t$ attends only to positions $\leq t$. Each block contains:

$$z_t^{(l)} = \mathrm{LayerNorm}(h_t^{(l-1)})$$ $$h_t^{(l)} = h_t^{(l-1)} + \mathrm{MultiHeadAttn}^{\mathrm{causal}}(z^{(l)})$$ $$h_t^{(l)} \mathrel{+}= \mathrm{FFN}(\mathrm{LayerNorm}(h_t^{(l)}))$$

Final logits: $\ell_t = h_t^{(L)} W_\mathrm{out}^\top$ with vocabulary distribution $P(x_{t+1} \mid x_{1:t}) = \mathrm{softmax}(\ell_t)$. Tied embeddings $W_\mathrm{out} = W_\mathrm{embed}$ are common, halving parameter count for large vocabularies.

Sampling from a trained model:

Greedy: $x_{t+1} = \arg\max P(\cdot \mid x_{1:t})$. Often produces repetitive or boring text.
Temperature: $P_T(x) \propto P(x)^{1/T}$. $T < 1$ sharpens, $T > 1$ smooths.
Top-$k$: restrict to the $k$ most probable tokens, renormalise.
Top-$p$ (nucleus): keep the smallest set whose cumulative probability exceeds $p$.

Training compute: GPT-3 (175B parameters) trained on 300B tokens used approximately $3.14 \times 10^{23}$ FLOPs ≈ thousands of GPU-years. The Kaplan/Chinchilla scaling laws relate compute, parameters, and data:

$$L(N, D) \approx \left(\frac{N_c}{N}\right)^{0.34} + \left(\frac{D_c}{D}\right)^{0.28}$$

with constants $N_c \sim 10^{13}$ parameters and $D_c \sim 10^{13}$ tokens calibrated empirically.

In-context learning: at inference, the model conditions on a prompt containing examples and a new query, predicting the answer purely from the conditional distribution it learned during pre-training, no gradient updates. The mechanism by which this works is the subject of ongoing research; current best understanding is that attention layers implement an implicit form of gradient descent over the in-context examples.

Video

Related terms: GPT, alec-radford, Transformer, Cross-Entropy Loss, Scaling Laws

Discussed in:

Chapter 13: Attention & Transformers, Attention and Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).