Modern AI: 15.2   Emergent abilities and the mirage critique

Dr Chris Paton

15.2 Emergent abilities and the mirage critique

One of the strangest claims to come out of the early-2020s scaling era is that some abilities arrive without warning. A one-billion-parameter model gets three-digit arithmetic almost entirely wrong. A hundred-billion-parameter model trained on the same recipe gets it almost entirely right. The same pattern was reported for multi-step word problems, for following instructions phrased as natural language, for chain-of-thought reasoning, for theory-of-mind questions and for a long list of multilingual tasks. Wei et al. 2022 called these jumps emergent abilities and offered a working definition: a capability is emergent if it is absent in smaller models and present in larger ones, with the transition appearing sharp rather than smooth. Schaeffer, Miranda and Koyejo 2023 replied that the apparent sharpness is largely an artefact of the metric, and that the underlying log-likelihood is scaling smoothly all along. Both papers are partly right. Capabilities scale predictably on log-perplexity; some downstream tasks look discontinuous when scored with hard-threshold metrics; and a small number of genuine circuit-level transitions sit underneath, distinct from either story.

Section 15.1 told the scaling story in the language of compute and loss: feed a Transformer enough tokens with enough parameters, on enough hardware, and the next-token cross-entropy comes down along a power law that holds across orders of magnitude. This section turns from the loss curve to the question of what the loss curve buys you. The user does not care about per-token cross-entropy; they care about whether the model can divide two four-digit numbers, write a working SQL query, follow an instruction with a negation in it, or admit when it does not know. Translating bits of perplexity into useful behaviour is where the controversy lives.

Wei et al. (2022), emergent abilities

Wei and colleagues plotted benchmark accuracy against model scale for more than a hundred BIG-Bench tasks and looked for tasks where smaller models did no better than chance and larger models did substantially better. They defined emergence informally, it was a description of a shape on a graph, not a theory, and they gave examples. Three-digit addition by GPT-3 sits flat at zero accuracy from 100 M up to roughly 13 B parameters, then climbs steeply to over fifty per cent at 175 B. Multi-step word problems on the GSM8K dataset show a similar curve. Modular arithmetic, transliteration into the International Phonetic Alphabet, and Persian question-answering all displayed the same pattern: a long flat region near the floor, then an elbow, then a rise.

The implications are unsettling if taken at face value. If capabilities really do appear without warning, you cannot extrapolate the safety properties of small models to large ones. A 7 B model that cannot write working exploit code tells you almost nothing about a 700 B model trained on the same data. Evaluation suites built on smaller checkpoints would systematically underestimate frontier risk, and a lab might, in principle, train through a sharp transition into a dangerous capability without noticing until it tested the final model. Wei et al. were careful not to make the strongest version of this claim, but their figures were the empirical anchor for a year of safety arguments built around it.

The paper also catalogued emergence in prompting strategies. Few-shot in-context learning is roughly useless below about 6 B parameters and works increasingly well above. Chain-of-thought prompting, appending "let us think step by step" or showing worked examples, actually hurts small models on arithmetic and word problems, because the longer outputs introduce more places to make mistakes, and only starts to help once the model is large enough that the per-step error rate is low. This second observation is important: emergence in Wei's sense was not only about what models can do at scale but about which prompting tricks become useful at scale. A teaching trick that backfires at 1 B can become the dominant lever at 70 B, which means the recipe for getting useful work out of a model is not stable across scales.

It is worth being careful about what Wei et al. did and did not claim. They did not propose a mechanism. They did not argue that the underlying loss was discontinuous. They did not predict at which scale a given task would emerge. The contribution was empirical: here are the curves, here are the tasks, here is the shape, and the shape is hard to ignore. The interpretive overreach happened in the secondary literature, where "emergent" came to suggest something almost magical, a phase transition with content unavailable to extrapolation. That stronger reading is what Schaeffer et al. went on to dismantle.

Schaeffer et al. (2023), the mirage critique

Schaeffer, Miranda and Koyejo titled their reply "Are Emergent Abilities of Large Language Models a Mirage?" and gave a sharp, mostly correct, partly overstated answer: yes, in many cases, the apparent emergence is a property of the scoring metric, not of the model. Their argument has two parts and a constructive demonstration.

First, the metric matters. Many of the tasks Wei et al. flagged are scored by exact-match accuracy. You get one point if every digit, every token, every character of the answer is correct, and zero otherwise. Suppose the model's per-token error rate $\varepsilon$ falls smoothly with scale, as the loss does. For a $k$-token answer, the probability of getting every token right is roughly $(1-\varepsilon)^k$. That function is flat near the floor while $\varepsilon$ is large, sigmoidal across the regime where $\varepsilon \approx 1/k$, and saturates near the ceiling once $\varepsilon$ is small. The shape on the plot is hockey-stick, but the underlying improvement in $\varepsilon$ is smooth, monotone and entirely predictable from the scaling laws.

Second, axis choices conceal smoothness. Plot loss against compute on log–log axes and you see a straight line. Plot exact-match accuracy against compute on log–linear axes and you see a phase transition. The transition is real on the chosen axes, but it does not reflect a discontinuity in the model. Schaeffer et al. demonstrated this constructively by taking the same model outputs and rescoring them with continuous metrics (token-level edit distance, partial credit, Brier score) and showing that the curves became smooth. They reproduced a "phase transition" in a synthetic model where, by construction, nothing discrete is happening, by simply choosing a hard-threshold metric. The point is not that no model gets better with scale, they obviously do, but that the appearance of suddenness is at least partly an optical illusion produced by the rubric.

The mirage paper is now widely cited and has reshaped how careful labs report capability evaluations. It argued, and most practitioners now accept, that you should track a continuous metric alongside any binary one, that the elbow on a log–linear accuracy plot is rarely the sign of anything physical, and that "emerged at $N$ parameters" is sloppy phrasing unless you can show a continuous proxy with the same shape. The mirage critique also has a quieter implication for safety: an evaluator who sees flat zero accuracy at small scale should not conclude that the model has zero capability, because the per-token error rate may be falling steadily underneath, and the capability may surface as soon as the answer length is short enough or the rubric is generous enough.

What's really going on

Both papers are partly right, and the synthesis is more interesting than either taken alone. Three things are true at once.

First, the next-token cross-entropy really is smooth in scale. Across six orders of magnitude in parameters and tokens, the loss falls along a power law with a small irreducible offset. Whatever else changes during scaling, the loss curve is well behaved.

Second, a great many "emergent" capabilities are exactly what Schaeffer et al. described. Multi-digit arithmetic looks emergent because exact-match accuracy on a $k$-digit answer is a $(1-\varepsilon)^k$ function of a smoothly improving error rate. Most multi-step benchmarks have the same structure. If the gradient of the loss is informing every weight every step, there is no reason to expect a phase transition in the underlying competence; the transition is in the scoring rubric.

Third, some genuine threshold effects survive the critique. In-context learning of arbitrary new symbol mappings, being told that "wug" means "dog" and using the new mapping coherently for the rest of the prompt, appears to require a circuit-level transition that not all training runs reach (Chan et al., 2022). Olsson et al. 2022 showed that induction heads, the attention pattern responsible for in-context copying and pattern completion, form abruptly during a narrow window of training, and the loss curve has a small but visible bump at exactly that point. These are genuine examples of discrete events in the model, not on the rubric. They are smooth in cumulative compute but sharp in epoch, and they correspond to identifiable circuits assembling. A handful of reasoning capabilities, robust chain-of-thought generalisation, certain instruction-following behaviours, multi-hop tool use, also appear to require scale in a way that does not vanish under softer metrics, though here the evidence is messier and the literature is still moving.

The summary is therefore: the loss is smooth, most "emergent" benchmark curves are artefacts of hard-threshold scoring on top of smoothly falling per-token error, and a small but real residue of phase-like transitions corresponds to circuit-formation events visible to mechanistic interpretability. Calling everything "emergent" was an overclaim; calling everything "a mirage" was an undersell.

Practical implication

For an engineer or researcher working at sub-frontier scale, this debate has concrete consequences. Do not try to demonstrate complex multi-step reasoning on a 1 B model and conclude that the architecture does not work; the per-token error rate is the wrong place to be on the $(1-\varepsilon)^k$ curve, and the same architecture at 70 B may behave entirely differently. Conversely, do not assume that scaling up will deliver proportional gains on every metric; some plateaus are real, some capabilities are bounded by data rather than parameters, and some downstream tasks are saturated by 7 B and unmoved by 70 B.

For evaluation, the operational rules are: track loss, not just accuracy, during training; report a continuous metric alongside any binary benchmark; check whether an apparent jump survives a softer rubric before calling it emergent; and run capability evaluations on a ladder of scales rather than only on the final checkpoint, so that you can see the shape of the curve rather than a single endpoint. For safety arguments, the relevant question is not "did this ability emerge?" but "is there a continuous metric on which it is climbing, and how fast?"; if the answer is yes and fast, you should plan for the next scale-up to push further along the same curve. For research, the most fruitful target is the small set of genuine circuit-level transitions, induction heads, in-context learning of novel mappings, possibly some reasoning circuits; because those are the places where mechanistic interpretability has something concrete to bite into and where the smoothness of the loss curve hides discrete structure underneath.

What you should take away

Wei et al. (2022) defined emergent abilities as capabilities absent in small models and sharply present in large ones; their plots showed dozens of benchmark tasks with this hockey-stick shape.
Schaeffer et al. (2023) showed that hard-threshold metrics like exact-match accuracy can manufacture apparent emergence on top of a smoothly falling per-token error rate, and that switching to continuous metrics often makes the curves gradual.
The next-token cross-entropy is smooth in scale across six orders of magnitude; most "emergent" benchmark curves are properties of the scoring rubric, not of the model.
A genuine residue remains: induction-head formation, in-context learning of arbitrary symbol mappings and some reasoning behaviours show real circuit-level transitions visible in the loss curve as small bumps.
In practice, track a continuous metric alongside any binary benchmark, evaluate at a ladder of scales, and reserve the word "emergent" for cases where a softer rubric still shows a sharp transition.