Attention & Transformers: 13.12   GPT: autoregressive LM and in-context learning

Dr Chris Paton

13.12 GPT: autoregressive LM and in-context learning

If you have ever typed a question into ChatGPT, asked Claude for help with an essay, or watched Gemini summarise a document, you have already used a GPT-style model. The acronym stands for Generative Pre-trained Transformer, and the family it names is now the dominant paradigm in artificial intelligence. Every frontier chatbot in 2026, OpenAI's GPT line, Anthropic's Claude, Google DeepMind's Gemini, Meta's Llama, Mistral, Alibaba's Qwen, DeepSeek, 01.AI's Yi, descends from the same architectural recipe and the same training objective.

The recipe is surprisingly simple to describe. Take a transformer, throw away the encoder, keep only the decoder, and train the result to do exactly one thing: given a sequence of words, guess what word comes next. Then make it bigger. Then bigger again. Train it on more text. Bigger still. Somewhere along the way, behaviours start to appear that nobody explicitly trained the model to perform, translation, arithmetic, code generation, even rudimentary reasoning. This is the GPT story, and this section walks through it from the original 117-million-parameter GPT-1 in 2018 to the trillion-parameter reasoning models of the mid-2020s.

This section pairs with §13.11, which covered BERT, the encoder-only sibling trained with masked language modelling. The two models were designed in roughly the same year by different teams, and they took opposite architectural decisions. BERT reads the whole sentence at once and is good at understanding text. GPT reads strictly left-to-right and is good at producing text. The world ultimately chose generation, and §15 will pick up where this section leaves off, covering how raw GPT-style "base models" are turned into the polite, helpful assistants you actually talk to.

Decoder-only architecture

A GPT model is a stack of identical transformer blocks, typically between twelve and a hundred and twenty of them, sitting between an embedding layer at the bottom and a vocabulary projection at the top. There is no separate encoder branch and no cross-attention. Every block contains the same two sub-layers you have already met: causal self-attention and a position-wise feed-forward network, each wrapped in a residual connection and a layer-norm.

The single architectural detail that makes the model "decoder-only" is the causal mask in the attention layer. When the network computes the representation of the token at position $t$, it is only allowed to attend to positions $1, 2, \dots, t$. Positions $t+1$ onwards are masked out by setting their attention scores to $-\infty$ before the softmax, which means they receive zero weight. Information flows strictly from past to future, never backwards. This is what lets the model train and run autoregressively, every position is a valid prediction target, because each one only ever saw the tokens that came before it.

After the final transformer block, the output of shape (sequence length, $d_\text{model}$) is multiplied by a linear projection of shape ($d_\text{model}$, vocabulary size). In most modern GPTs this projection shares its weights with the input embedding matrix, which saves parameters and acts as a mild regulariser. The result is a tensor of logits, one real number for every possible next token at every position. A softmax across the vocabulary axis turns those logits into a probability distribution, and the largest entry is the model's best guess for what should come next. That is the entire forward pass: tokens in, probabilities out, no encoder, no cross-attention, no special heads.

Pretraining: next-token prediction

The training objective is the simplest one in the language-modelling literature. Given a sequence of tokens $x_1, x_2, \dots, x_T$, the model assigns a probability to the whole sequence by the chain rule of probability:

$$p(x_1, x_2, \dots, x_T) \;=\; \prod_{t=1}^{T} p(x_t \mid x_1, \dots, x_{t-1}).$$

We train the model to maximise the log-likelihood of real text under this factorisation, which is the same as minimising the cross-entropy loss

$$\mathcal{L} \;=\; -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{\lt t}).$$

That is it. There are no labels to collect, no hand-curated supervision, no explicit task definitions. The training data is the internet, scrubbed of duplicates and the worst of the spam, Common Crawl, Wikipedia, books, code repositories, scientific papers, forum posts. Modern frontier runs use somewhere between one and twenty trillion tokens. Each training step picks a chunk of text, shifts it by one position to make a target, runs it through the model, computes the cross-entropy loss across every position in parallel (the causal mask makes this safe), and back-propagates.

The genius of next-token prediction is that almost everything you might want a language model to know is implicitly encoded in the task. To predict the word after "The capital of France is" the model must learn geography. To predict the next line of a Python function it must learn syntax and intent. To predict what comes after a half-finished proof it must learn mathematics. The objective is a strict scalar, log-likelihood per token, but the capabilities required to drive it down are open-ended. Scale that objective up enough and you have a system that has, in some statistical sense, modelled the joint distribution of all human writing.

Scaling: GPT-1 to GPT-4

The story of GPT is, more than anything, a story of scaling. Every generation looks almost identical to the previous one as a piece of code; the differences are in the parameter count, the training-token count, and the engineering required to run that bigger thing on real hardware.

Model	Year	Parameters	Training tokens	Notable
GPT-1	2018	117M	~5B	Showed pretraining + fine-tuning works for NLP
GPT-2	2019	1.5B	~10B	Zero-shot tasks; original release withheld
GPT-3	2020	175B	~300B	In-context learning emerges with scale
GPT-3.5	2022	undisclosed	undisclosed	Instruction tuning + RLHF; powered ChatGPT
GPT-4	2023	~1T (MoE, rumoured)	~13T	Multimodal input, large jump in reasoning
GPT-4o	2024	undisclosed	undisclosed	Native text/vision/audio in one model
Llama 3.1 405B	Jul 2024	405B (dense)	~15T	Largest open-weight dense model
DeepSeek-V3	Dec 2024	671B total / 37B active (MoE)	14.8T	FP8 training, ~$5.6M reported cost
o-series	2024–25	undisclosed	undisclosed	Long chain-of-thought, RL on reasoning
Llama 4 herd	Apr 2025	Scout / Maverick / Behemoth	undisclosed	Multimodal MoE family
GPT-5	Aug 2025	undisclosed	undisclosed	Frontier model; training compute undisclosed
Gemini 3.1 Pro	Feb 2026	undisclosed	undisclosed	1M-token context
Claude Opus 4.7	Apr 2026	undisclosed	undisclosed	~1M-token context as standard

GPT-1 was a quiet paper. It demonstrated that a single transformer could be pretrained on raw text and then fine-tuned on a handful of downstream tasks (translation, classification, similarity), beating bespoke models. BERT, released a few months later, made a bigger splash and dominated the leaderboards for two years.

GPT-2, in 2019, was a structural copy of GPT-1 made roughly thirteen times larger. The headline result was that, given the right prompt, the model could perform tasks it had never been explicitly trained on. Feed it "English: hello\nFrench:" and it would continue with "bonjour". OpenAI initially withheld the largest version, citing misuse concerns, and released it in stages over the following months.

GPT-3, in 2020, was the moment the scaling hypothesis arrived. At 175 billion parameters it was about a hundred times larger than GPT-2 and used roughly a thousand times more compute, costing on the order of $3 \times 10^{23}$ floating-point operations to train. The architecture was unchanged save for an alternation of dense and locally-windowed sparse attention layers borrowed from the Sparse Transformers paper. What changed was qualitative: GPT-3 could be steered by examples in its prompt rather than by gradient updates.

GPT-3.5 added instruction tuning and reinforcement learning from human feedback on top of GPT-3 and became, in late 2022, the engine of the original ChatGPT. GPT-4 in 2023 was substantially larger, accepted images as input, and was widely believed (though never officially confirmed) to use a Mixture-of-Experts architecture. GPT-4o in 2024 was natively multimodal across text, audio and vision in a single network. The o-series, o1, o3 and their successors, introduced reasoning models that produce long internal chains of thought before answering, trained with reinforcement learning on verifiable problems.

Every other frontier laboratory followed the same recipe. Anthropic's Claude, Google's Gemini, Meta's Llama, Mistral's open-weight models, DeepSeek's V3 and R1, Alibaba's Qwen, all are decoder-only transformers trained on next-token prediction at trillion-token scale. The architectural debate that animated the 2018–2020 period is settled. Differences between frontier labs lie in data quality, post-training, alignment and inference engineering, not in the structure of the network.

In-context learning

The most surprising single result in the GPT-3 paper was a phenomenon that came to be called in-context learning, often abbreviated ICL. Suppose you want GPT-3 to translate English to French. You do not need to fine-tune it. You simply hand it a prompt that looks like this:

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe peluche
cheese =>

GPT-3 will continue with "fromage". It has inferred the task from the three demonstrations in the context, completed it for the new input, and stopped, all without a single weight being updated. The model's parameters are the same before and after. The "learning" lives entirely in the forward pass.

This was unprecedented. Earlier models could only solve a task by being trained for it. GPT-3 could be reprogrammed at inference time using nothing but well-chosen text. Crucially, the effect strengthened sharply with scale. Below about ten billion parameters, few-shot prompting barely worked; above a hundred billion, it worked across translation, arithmetic, named-entity recognition, summarisation, code completion and rough reasoning. The capability appeared to emerge with size.

Mechanistic interpretability has since traced part of this phenomenon to induction heads, pairs of attention heads that together implement the algorithm "find a previous occurrence of the current token in the context and copy whatever followed it." Induction heads form during training in a narrow window that coincides with a sudden drop in the loss curve, and they appear to be a building block, though not the whole story, of in-context learning.

In-context learning is also the seed of nearly everything that followed in the prompt-driven world. Few-shot prompting, chain-of-thought prompting (where you ask the model to "think step by step" and watch reasoning quality jump), retrieval-augmented generation (where you stuff relevant documents into the context window before the question), tool use, and the entire chatbot user experience all rest on the same observation: a sufficiently large autoregressive language model uses its context window as a working memory and a programmable interface.

Sampling strategies

Once a model is trained, you still have to choose how to turn its probability distributions into actual text. The model gives you a distribution over the next token at every step; the decoding strategy picks one. The choice matters more than beginners usually expect, and the same model can sound mechanical, creative, or incoherent depending on it.

Greedy decoding picks the highest-probability token at every step. It is deterministic and cheap, but it is also dull and prone to repetition loops. Models have a habit of getting stuck repeating a phrase forever once that phrase becomes locally very probable.

Beam search keeps the top $k$ partial sequences alive at each step, expanding all of them and pruning back to $k$. It improves on greedy for tasks with a single correct answer such as machine translation, but for open-ended generation it tends to produce strangely bland text, beam search systematically prefers high-probability sequences, which are not the same as interesting ones.

Temperature sampling divides the logits by a temperature $\tau$ before the softmax. With $\tau = 1$ you sample from the model's true distribution. Lower temperatures sharpen the distribution and make the model more conservative; higher temperatures flatten it and make the model more adventurous. $\tau = 0$ recovers greedy decoding.

Top-k sampling restricts the choice to the $k$ most probable tokens at each step and renormalises. Top-p sampling, also called nucleus sampling, instead keeps the smallest set of tokens whose cumulative probability exceeds $p$, the "nucleus" of the distribution, and samples within it. Nucleus sampling adapts automatically to how peaked or flat each step is, and is the default in most production systems, often with $p$ around 0.9 and a modest temperature near 0.7.

Newer variants include min-p sampling, typical sampling, and speculative decoding (in which a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them in a single pass, dramatically speeding up inference without changing the output distribution).

Where GPT-style models live in 2026

By the time you are reading this, the GPT paradigm has comprehensively colonised modern AI. Every frontier chatbot you can talk to is a decoder-only transformer trained on next-token prediction and then post-trained for helpfulness and safety. The major closed-weight families are OpenAI's GPT and o-series, Anthropic's Claude, Google DeepMind's Gemini, and xAI's Grok. The major open-weight families are Meta's Llama, Mistral, Alibaba's Qwen, DeepSeek, and 01.AI's Yi. All of them share the architecture described in this chapter.

The variation is in the periphery. Some use Mixture-of-Experts in the feed-forward layers (DeepSeek-V3, GPT-4 by reputation, Mixtral). Some are dense throughout (most Llama and Qwen variants). Some have native vision and audio paths (GPT-4o, Gemini, Claude 4 with image input). Some are trained for very long context using rotary position embeddings or other position schemes (Gemini's million-token context, Claude's 200k). Some have an extra layer of reinforcement learning on verifiable problems to produce long chains of thought (o1, o3, DeepSeek-R1, Claude 4 Sonnet thinking). Underneath all of these decisions sits the same backbone, a stack of causally-masked transformer blocks predicting the next token.

What this means for you, as someone learning the field, is that mastering one decoder-only transformer is essentially mastering the skeleton of every modern LLM. The differences are real and they matter for engineering, but they do not require a new mental model.

What you should take away

GPT means decoder-only transformer trained on next-token prediction. Take the transformer, drop the encoder, mask attention to be causal, project to vocabulary logits, train with cross-entropy on raw text. That is the architecture and that is the objective.
The scaling story matters more than any single innovation. GPT-1 to GPT-4 is essentially the same model made bigger and trained on more data. Capabilities appeared as scale grew, not as new tricks were added.
In-context learning is the defining surprise of GPT-3. A large enough autoregressive model can learn tasks from examples in its prompt without any weight updates. This is the foundation of prompting, few-shot learning, and chain-of-thought.
Decoding strategy shapes how the model sounds. Greedy is deterministic and dull, beam search is bland for open generation, nucleus (top-p) sampling with a modest temperature is the modern default for chat. The same model will feel completely different depending on the choice.
Almost every frontier LLM in 2026 is a GPT-style model. Claude, Gemini, Llama, Qwen, DeepSeek, Mistral, Yi, all decoder-only transformers trained on next-token prediction. Chapter 15 will cover how these base models become the polite assistants you actually use.