Glossary

Emergent Abilities

Emergent abilities of large language models, characterised by Wei et al. (2022) in "Emergent Abilities of Large Language Models" (TMLR), are capabilities that appear sharply with scale, sometimes essentially absent at one parameter count and dramatically present at the next, rather than improving smoothly with scale. The empirical phenomenon, and the controversy surrounding its interpretation, has been one of the central topics of frontier-model research since 2022.

The empirical phenomenon

Wei and colleagues catalogued dozens of tasks on which model performance was near-random until some scale threshold and then jumped to substantial competence. Examples include:

  • Three-digit arithmetic (e.g. $384 + 729$): essentially zero accuracy below GPT-3 scale, $>50\%$ accuracy at GPT-3 175B and beyond.
  • Transliteration between scripts.
  • Modular arithmetic and word unscrambling.
  • Instruction following without fine-tuning.
  • Chain-of-thought reasoning: smaller models do not benefit (or are hurt) by CoT prompting; larger models gain substantially.
  • Performance on BIG-Bench hard tasks (logical deduction, multi-step word problems).

The plots, flat near random for orders of magnitude, then a sharp upturn, shaped the scaling-laws narrative: capability progress with scale is not always smooth, and qualitative jumps in functionality may appear at thresholds that cannot be predicted from below.

Why this mattered for the field

Emergence has been one of the central observations driving frontier-model investment. If ability $A$ first appears at $10^{23}$ FLOPs, then ability $B$, currently absent, might appear at $10^{25}$, and ability $C$ at $10^{27}$. The argument by extrapolation is that future capabilities will continue to emerge unpredictably, justifying ever-larger compute budgets. This logic has informed multi-billion-dollar training runs at OpenAI, Anthropic, Google DeepMind and Meta.

The Schaeffer challenge

Schaeffer, Miranda and Koyejo (NeurIPS 2023), in "Are Emergent Abilities of Large Language Models a Mirage?", argued that many "emergent" capabilities are artefacts of evaluation metrics. Discontinuous metrics, exact-match accuracy, BLEU above a threshold, multiple-choice top-1, have a threshold structure that converts smooth underlying improvements into apparent step-functions:

  • For a $k$-digit arithmetic task, exact-match accuracy is roughly $\Pr[\text{correct}]^k$. Small smooth improvements in per-token probability raise this from $\sim 0$ to substantial values along an S-curve.
  • Replacing exact-match with token-edit distance or per-token log-likelihood smooths the curves and removes most "emergence".

The current state of the debate

The challenge is partly correct and partly contested:

  • Some "emergence" is genuinely metric-induced. Switching to graded metrics often eliminates the discontinuity.
  • Some capabilities do appear non-smoothly even under continuous metrics, for instance, in-context learning of certain algorithmic tasks, or grokking phenomena (Power et al., 2022) where a model suddenly generalises after long apparent memorisation.
  • Phase transitions in training dynamics (e.g. the formation of induction heads at a particular training step, Olsson et al. 2022) suggest there are real qualitative shifts in model behaviour that are not just metric artefacts.

Related phenomena

  • Grokking (Power et al., 2022): delayed generalisation in which a model suddenly transitions from memorisation to generalisation long after training loss has converged.
  • Phase transitions in mechanistic interpretability: induction-head formation, the sudden appearance of factual recall circuits.
  • Emergent multilingualism: smaller models perform poorly across languages, larger ones develop cross-lingual transfer.

Whether emergence is real, an evaluation artefact, or some mixture remains an active research question. Practically, it has motivated frontier labs to push for ever-larger scale on the hypothesis that future capabilities will continue to emerge. The phenomenon also feeds into safety arguments: dangerous capabilities (deception, autonomous replication) might appear suddenly without warning, complicating the case for predictive evaluations.

Related terms: Scaling Laws, Chain-of-Thought, In-Context Learning

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).