15.10 In-context learning and few-shot

When Brown and colleagues released GPT-3 in 2020 Brown, 2020, the demonstration that captured the field was short. Type a heading, paste a few examples of a task, leave the last example unfinished, and the model would complete it. No fine-tuning. No gradient steps. No data pipeline. Show three English-to-French pairs, ask for a fourth translation, and the answer arrived. Show a few examples of arithmetic, sentiment classification, or unscrambling letters, and the same pattern held. The phenomenon was named in-context learning (ICL), and it has reshaped how practitioners think about adapting language models. For most users today, "using an LLM" means writing a prompt, not training a model. The weights are fixed; the adaptation lives in the prompt window.

The previous sections traced how models acquire their dispositions through training: pre-training on text (§15.3), supervised fine-tuning (§15.4), and reinforcement learning from human feedback (§15.5). All of those stages alter the parameters. ICL works in the opposite regime. The parameters are frozen at inference time, and behaviour shifts because the input changes. This makes ICL practically transformative, adapting a deployed model to a new task takes seconds rather than days, and the same model can serve thousands of distinct downstream uses simultaneously without ever being retrained, and intellectually puzzling, because nothing in classical machine learning theory predicts that a fixed function should learn from inputs alone. The classical framing assumed that a model's parameters encoded its hypothesis; ICL forces a richer view in which the prompt itself becomes part of the hypothesis, decoded on the fly by attention. This section sketches what ICL looks like, what it appears to be doing under the bonnet, where it fails, and how to wield it in practice.

Symbols Used Here
$x$input; $y$, output; $\{(x_i, y_i)\}$, in-context examples

What ICL looks like

A few-shot prompt is a list of input-output pairs followed by a fresh input. The canonical illustration from the GPT-3 paper translates English into French:

Translate to French:
sea otter -> loutre de mer
plush giraffe -> girafe en peluche
cheese -> ?

The model continues the pattern and produces fromage. Nothing about the architecture has changed since the prompt began. The same set of weights that, moments earlier, was completing a half-finished poem is now translating, because the context conditions the next-token distribution towards plausible French nouns.

The same template generalises across domains. A handful of (review, sentiment) pairs turns the model into a sentiment classifier; a few (Python, English) pairs turns it into a code explainer; a few (symptom list, differential) pairs turns it into a rough clinical reasoner. Brown and colleagues evaluated this across twenty-four NLP benchmarks and found a consistent pattern: zero-shot performance was often poor, performance climbed sharply between zero, one, and a handful of examples, and the curve plateaued by sixteen to thirty-two demonstrations. The number of demonstrations that helped scaled with model size, small models gained little from extra examples, while larger models extracted ever more signal. ICL became the default deployment pattern for GPT-3, and it has remained central even as instruction tuning and chat fine-tuning have made many tasks accessible from a plain description with no examples.

The convention of zero-shot, one-shot, and few-shot prompting comes from this paper. Zero-shot means a task description with no demonstrations; one-shot means exactly one (input, output) pair; few-shot means a handful, usually fewer than thirty-two. The terminology is now applied loosely to anything from in-context demonstrations to retrieval-augmented prompts, but the original distinction is useful because the three regimes behave differently: zero-shot tests whether the model already knows the task implicitly, one-shot tests whether a single example is enough to disambiguate the format, and few-shot tests whether more examples genuinely sharpen the inferred mapping.

Why it works

Three families of explanation, none of them complete on their own, dominate the literature.

The first is Bayesian. Pre-training data implicitly mixes thousands of latent tasks: translation, classification, code completion, casual chat. Xie and colleagues (2022) modelled the pre-training distribution as a mixture over these tasks and showed that ICL behaves like posterior inference. The few-shot examples narrow the model's belief about which task is being requested; the final input is then completed under the conditioned posterior. Formally, if the prior over tasks is $p(t)$ and each task induces a distribution $p(x, y \mid t)$, then conditioning on the demonstrations $\{(x_i, y_i)\}$ produces a posterior $p(t \mid \{(x_i, y_i)\})$ over tasks, and the prediction for the new input is the posterior-averaged conditional. This view explains several robust observations: the structure of the demonstrations matters more than their specific content, the marginal benefit of additional examples diminishes as the posterior concentrates, and diverse examples within a task help more than redundant ones. It also explains why bigger models do better ICL, they capture a richer prior over tasks.

The second family is mechanistic. Olsson and colleagues 2022 identified the induction head, a circuit composed of two attention heads working in tandem. The first head, the previous-token head, attends from each position to its immediate predecessor. The second head, the induction head proper, uses that information to attend from the current token at position $t$ back to the token that followed the previous occurrence of the same token elsewhere in the context. The composition implements a simple rule: if the prompt earlier showed "$A B$" and now shows "$A$", predict $B$. This is the minimal pattern-matching circuit, and its sudden appearance during pre-training corresponds to a small but visible bump in the loss curve. Induction heads are task-agnostic copy machines, which is precisely why they generalise to novel symbol mappings the model never saw during training.

The third family argues that ICL implements a learning algorithm within the forward pass. Garg and colleagues 2022 showed that Transformers trained on linear regression problems can solve fresh regression tasks given input-output pairs in context, matching the behaviour of an actual gradient-descent learner. Akyürek and colleagues 2023 extended the result to kernel regression and showed that the network's attention layers can be reinterpreted as performing implicit gradient steps on the demonstrations. Each forward pass through the stack of attention layers corresponds, under the right construction, to a few iterations of gradient descent on the in-context regression objective, with the demonstrations playing the role of training data and the query playing the role of a test point. The picture that has emerged is pluralistic: ICL is not a single mechanism but a learned algorithm whose internal shape depends on the task, with induction-head copying, Bayesian conditioning, and gradient-style updates all playing roles in different settings. On symbol-mapping tasks the induction circuit dominates; on regression-shaped tasks the implicit-gradient view fits best; on natural-language tasks where the prompt selects among familiar styles the Bayesian framing is most predictive.

Limits

ICL has well-documented brittleness. Format sensitivity is the most prominent. Small changes to whitespace, capitalisation, the choice of separator (-> versus :), or the position of newlines can swing accuracy by tens of percentage points. Order sensitivity is closely related: shuffling the same set of demonstrations can flip predictions, particularly in classification tasks where the final example exerts a recency bias. Label-distribution sensitivity means that the empirical frequency of labels in the few-shot block leaks into the model's predictions, a problem when classes are imbalanced. Modern instruction-tuned models reduce these effects substantially compared with raw base models, but they have not eliminated them.

A deeper limit is that ICL does not change the weights. Whatever the model learned during pre-training is what it can call upon. Tasks whose distribution is genuinely outside the training corpus, a novel cipher with no analogue, a private taxonomy, a niche scientific notation, cannot be summoned by demonstrations alone. Long-tail tasks need either retrieval (§15.14), tools (§15.13), or fine-tuning. The quadratic cost of attention also bounds how many demonstrations fit in the window: with a million-token context the practical ceiling is generous, but a thousand demonstrations of a domain-specific task still cost real money at inference time, and the marginal value of the thousandth example is small. Finally, the absence of weight updates means the model does not consolidate what it has just done. Each fresh conversation starts blank, and lessons learnt the hard way in one session are forgotten by the next unless they are written into the prompt or stored in an external memory.

A subtler failure mode is silent miscalibration. Because the model produces fluent, confident output regardless of whether the demonstrations are sufficient, users can mistake confidence for competence. The model that translates cheese as fromage will, with the same tone, hallucinate plausible-looking French for a word that does not exist, or apply a spurious pattern from poorly chosen demonstrations. Industrial deployments compensate with held-out evaluation sets, A/B testing of prompt variants, and downstream verification, code execution, retrieval over ground-truth corpora, or human review, rather than relying on the prompt alone.

Chain of thought

Wei and colleagues 2022 noticed that prompts which included worked-through reasoning steps before the final answer dramatically improved performance on multi-step problems. Showing the model a few maths-word problems alongside their step-by-step solutions, then asking a fresh question, produced step-by-step solutions in reply, and accuracy rose sharply on benchmarks like GSM8K and arithmetic word problems. The effect was strongest on tasks that required composing several inferences, weakest on tasks that needed a single look-up. Kojima and colleagues 2022 then showed that the simpler instruction "Let's think step by step" achieved much of the same gain in zero-shot mode, suggesting that the few-shot examples were not teaching the model to reason from scratch but cueing it to access reasoning behaviour it had already learnt during pre-training from textbooks, tutorials, and worked-example collections.

Mechanistically, chain-of-thought (CoT) buys the model more compute. A single forward pass has fixed depth; intermediate tokens act as a scratchpad, letting the model spread a multi-step calculation across many forward passes that condition on earlier scratch work. The architectural argument is simple: a Transformer of depth $d$ can perform at most $d$ sequential reasoning steps in a single token's worth of compute, so problems that need more steps have to spread the work across multiple generated tokens, each of which conditions on the previous tokens through attention. CoT prompting was the precursor to the reasoning-RL recipes covered in §15.7. The reasoning models trained with verifiable rewards, o1, o3, R1, Claude with extended thinking, Gemini Deep Think, internalise the CoT habit so deeply that no prompt is needed: they think, often for tens of thousands of tokens, by default, exploring multiple candidate solutions, checking arithmetic, and revising errors before committing to a final answer. Section 15.11 develops CoT as a topic in its own right, including the open question of whether the produced reasoning is faithful to what the model is actually computing or merely a plausible narrative laid alongside the real computation.

Practical prompt engineering

The day-to-day craft of ICL is closer to interface design than to research. A handful of heuristics have stabilised across the practitioner literature. Make the demonstrations look like the deployment distribution: if production inputs are emails, demonstrate on emails; if they are clinical notes, demonstrate on clinical notes. Keep formatting consistent across demonstrations and the final query, separators, casing, line breaks, label vocabulary. Choose diverse demonstrations rather than redundant ones; coverage of the input space matters more than the count, and a handful of well-chosen edge cases often beats twenty near-duplicates. Mind the order: place the most representative example near the end if the model exhibits recency bias, or randomise across calls if a single ordering biases predictions. For multi-step problems, include CoT in the demonstrations; for classification, omit it because explicit reasoning can make the model second-guess straightforward labels. Test with several prompt variants before fixing one in production, because the gap between a mediocre and an excellent prompt on the same task is routinely ten or more accuracy points, and this gap rarely closes through model upgrades alone.

A final practical note: ICL composes naturally with retrieval. Rather than hand-curating demonstrations once and reusing them, deployed systems often retrieve the $k$ most similar examples from a labelled corpus given the current query, splice them into the prompt, and run the model. This dynamic ICL, sometimes called retrieval-augmented in-context learning, gets the best of both worlds: the demonstrations always look like the current input, the corpus can be updated without retraining, and the model still adapts on the fly. The cost is a retrieval index and the engineering to keep it fresh; the benefit is that the same base model can serve hundreds of specialised tasks from a single deployment.

What you should take away

  1. ICL is adaptation without weight updates: the model behaves differently because the prompt is different, while the parameters stay frozen.
  2. Three explanations work together, Bayesian conditioning over latent tasks, induction-head pattern copying, and implicit gradient-style updates inside the attention stack.
  3. ICL is brittle to format, order, and label distribution; instruction tuning softens this but does not remove it.
  4. Long-tail tasks, very long demonstration sets, and anything outside the pre-training prior remain out of reach for ICL alone, and need retrieval, tools, or fine-tuning instead.
  5. Chain-of-thought turns ICL from a one-shot guess into a multi-step computation, and is the prompt-time precursor to the reasoning models trained in §15.7.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).