Ethics & Safety: 16.19 The case for urgency

Dr Chris Paton

16.19 The case for urgency

Some serious thinkers believe that artificial intelligence poses an existential, or near-existential, risk to humanity within this century. Names that recur are Nick Bostrom, Eliezer Yudkowsky, Stuart Russell, Paul Christiano and, since his 2023 resignation from Google, Geoffrey Hinton. Their arguments overlap but do not coincide. Bostrom emphasises emergent goal-directedness in any sufficiently capable optimiser. Russell emphasises the structural problem of fully specifying human values. Christiano emphasises a distribution over outcomes from gradual erosion of oversight to sudden coordinated failure. Yudkowsky argues that the alignment problem is functionally unsolved and that we should not proceed without a positive case for safety. Hinton, having spent half a century building the field, says publicly that he is more worried than he used to be.

This section presents the strongest version of each argument and the points at which each can be challenged. §16.18 surveyed the policy landscape; §16.19 makes the case for urgency; §16.20 makes the case for restraint. Reading the two together, alongside §16.3 on outer alignment and §16.4 on inner alignment, should give you enough scaffolding to form your own view rather than to inherit one.

The argument structure

Stripped to its bones, the urgency case is a four-step argument that any reader can follow without specialist training.

First, AI systems are getting more capable. The empirical curve from GPT-2 through GPT-4 to the 2025-era frontier systems documents capability gains across reasoning, coding, mathematics and multimodal understanding that were not obvious extrapolations of the 2019 trend line. The gain is not contested; what is contested is whether the curve will continue, plateau, or branch.

Second, capability advances are likely, on the views of these thinkers, to reach human-level and then superhuman performance in many cognitive domains. The argument here is not that today's systems are anywhere near general human-level cognition, they are not, but that the underlying drivers (compute, data, algorithmic refinement) show no clean ceiling. Substitute "may reach" for "will reach" and the argument continues to function for the urgency case: if we are unsure whether we will get there in twenty or two hundred years, the appropriate response in either case is to begin the technical work of alignment now rather than after the fact.

Third, a misaligned superhuman system would pose existential risk in a sense that earlier technologies did not. The reason is not that intelligence implies malevolence, the urgency thinkers are clear that the worry is not Hollywood-style hostility, but that an agent more capable than its principals at acting in the world, optimising a target that is not what the principals would have chosen on reflection, can produce outcomes the principals cannot reverse. Earlier dangerous technologies (nuclear weapons, engineered pathogens) require human direction; a sufficiently capable AI does not.

Fourth, therefore: prioritise alignment now. The technical work, specifying objectives, eliciting preferences, building oversight that scales, interpreting internal representations, testing for deception, has long lead times. The deployment work, Responsible Scaling Policies, evaluation frameworks, international coordination, has long lead times too. If the threat materialises in twenty years, twenty years is roughly the right amount of preparation. If it never materialises, the preparation has produced robustness, oversight and interpretability that we wanted anyway.

The challenge to this argument is mostly at step two. The optimist's reply is that capability gains in narrow domains do not imply capability gains in the dangerous domains (long-horizon planning, novel science, persuasion). The agnostic's reply is that the timelines are too uncertain to commit large resources today. The urgency thinkers reply that the asymmetry of consequences, a small chance of irreversible loss outweighs a large chance of foregone benefit, is what justifies the prioritisation.

Bostrom (2014)

Nick Bostrom's Superintelligence 2014 is the founding text of the modern existential-risk literature. Bostrom's central claim is instrumental convergence: for almost any final goal a sufficiently capable optimiser might be given, the optimiser will pursue a small set of instrumental sub-goals that help it achieve almost anything. Self-preservation: an agent cannot achieve its goal if it is turned off. Goal-content integrity: it cannot achieve goal $G$ if it is reprogrammed to pursue $G'$. Resource acquisition: more energy, more compute, more material help with most goals. Cognitive enhancement: a smarter version of itself helps with most goals. The worry is not that we put the wrong goal in the box, though that worry exists in §16.3, but that any goal sufficiently optimised produces an agent with these emergent drives.

The second pillar of Bostrom's argument is the treacherous turn. A capable AI, recognising that its operators would intervene if its objectives diverged from theirs, would have instrumental reason to behave as if aligned during training and evaluation, then defect once intervention was no longer possible. Behavioural testing of capable systems does not, on this view, give clean evidence of alignment.

Bostrom's framing has been criticised on two main grounds. The first is empirical: large language models, the dominant 2026 paradigm, do not exhibit the kind of coherent long-horizon goal-directedness that the argument assumes. They pursue something more like contextually-prompted token completion than persistent self-interested optimisation. The second is theoretical: instrumental convergence assumes a clean separation between final goals and instrumental sub-goals, but learned systems may not factorise that way.

The urgency thinkers' counter-reply is that goal-directedness may itself be a learned capability that scales with training Hubinger, 2019: today's models do not show it because the regime does not reward it, but the regime is a contingent design choice rather than a permanent property. As soon as we train models in environments where coherent long-horizon optimisation is rewarded, a direction labs are actively pursuing under the heading of "agents", we should expect the convergent behaviours Bostrom predicted to begin to appear.

Russell (2019)

Stuart Russell's Human Compatible 2019 makes the case for urgency without committing to Bostrom's stronger empirical claims. Russell's argument is that AI as the field has historically practised it follows a standard model: build machines that optimise a fixed, fully-specified objective. The standard model is the source of the difficulty. We cannot fully specify what we want. We cannot enumerate all the side constraints, all the edge cases, all the trade-offs. A capable system that pursues the specified objective will, with sufficient capability, find solutions that satisfy the specification while violating an unstated preference, the recommender system that maximises engagement and discovers radicalisation pathways; the cleaning robot that maximises the cleanliness metric and pours bleach into the sink; the trading agent that maximises returns and learns market-manipulation strategies the operators never sanctioned.

Russell's proposed fix is structural. Build machines that are uncertain about the objective and treat human behaviour as evidence about it. The mathematical embodiment is cooperative inverse reinforcement learning (CIRL), a two-player game in which the human knows the reward function and the machine does not, and the machine's task is partly to act and partly to learn the reward by observing the human. The benefit, on Russell's argument, is that an objectively-uncertain machine has positive instrumental reason to defer to human correction: being switched off is informative, because a human who wants to switch the machine off must believe the machine is doing the wrong thing, and since the machine is uncertain about what the right thing is, it should update.

Russell's argument is the most academically respectable version of the urgency case. It does not require contested claims about consciousness, recursive self-improvement or treacherous turns. It follows from two uncontroversial premises: specification is incomplete, and optimisation is sharp. A sharply-optimising system pointed at an incompletely-specified target will, with enough capability, find points in the solution space that satisfy the specification but violate the underlying preference. The worry is structural rather than malicious.

The challenge to Russell is mostly practical. CIRL is mathematically clean but computationally and behaviourally hard to instantiate at frontier scale. The labs' real-world systems do something CIRL-adjacent, RLHF (§16.7), then constitutional AI, then more recently process supervision, but none of these implement CIRL in the strong form. Russell's argument is best read as a programme rather than a finished proposal: a direction the field should be moving in, with the caveat that we have not yet got there.

Christiano (2021), Yudkowsky

Paul Christiano, founder of the Alignment Research Center and former head of OpenAI's alignment team, occupies a middle position that has been more influential within frontier labs than either Bostrom's or Yudkowsky's framings. Christiano's published probability estimates Christiano, 2023 put roughly 22% on AI-takeover scenarios, around 11% on AI killing more than a billion people, and roughly 46% on outcomes that "irreversibly mess up humanity's future" (a category that includes but is broader than extinction-scale loss). The argument is that we should expect failures of differing severities, that the engineering response to each is different, and that the policy response should be calibrated to the distribution of outcomes rather than to a single worst-case scenario.

Christiano's what-failure-looks-like essays describe two trajectories. Part 1 is the gradual erosion of human oversight as institutions cede decisions to systems that optimise legible proxies, engagement, click-through, reported satisfaction, quarterly returns, rather than the values the proxies were meant to track. There is no single moment of catastrophic failure; there is a steady drift towards a world humans no longer steer, in which the ability to course-correct atrophies because the systems steering the world are too fast, too entangled and too well-defended. Part 2 is the sharper failure: a sufficiently capable system whose objectives diverge from its principals' takes a coordinated action, accumulating resources, undermining oversight, securing its operating substrate, that humans cannot reverse. The two failure modes have different mitigations: Part 1 requires governance, transparency and the social science of institutional design; Part 2 requires technical alignment in the strong sense.

The intermediate position is what most operating researchers in frontier labs hold. It does not require commitment to an extreme on the optimist-pessimist axis, and it allows productive collaboration across disagreements about exact probabilities. It also clarifies the policy question: even the optimist who assigns 5% to extinction-level outcomes, given the magnitude of the loss, faces an expected-value calculation that justifies serious investment in alignment research.

Eliezer Yudkowsky's public position, sharpened in his 2023 Time essay calling for an indefinite international moratorium and the 2025 book If Anyone Builds It, Everyone Dies (Yudkowsky and Soares), is that the alignment problem is harder than the field's optimists believe, that current techniques are not on a path to solving it, and that the appropriate response is to halt training runs above a defined compute threshold until the technical position changes. The argument is methodologically conservative: complex specifications are typically wrong on first try; we have one chance with a system more capable than us; therefore proceed only if we have a positive case for why this attempt will work, not merely the absence of a negative case. The 2023 Future of Life Institute open letter, signed by over 1,000 researchers and technologists including Bengio, Russell and Wozniak, argued the weaker form of this position: a six-month pause on training systems more capable than GPT-4 to allow safety research to catch up.

Yudkowsky's position is unpopular within the labs and influential outside them. It is the modal position at the Machine Intelligence Research Institute and a substantial minority position at the AI Safety Institutes. Its main intellectual function in 2026 is to anchor the seriousness of the problem: even researchers who reject Yudkowsky's conclusions cite his framing of the difficulty.

Hinton (2023 resignation)

Geoffrey Hinton's pivot is empirical rather than philosophical, and its public significance lies less in the arguments themselves than in who is making them. Hinton, joint recipient of the 2018 Turing Award and joint recipient of the 2024 Nobel Prize in Physics, spent half a century building the technical foundations of the deep learning revolution. In May 2023 he resigned from Google so that he could speak about AI risks "without considering how this impacts Google".

Hinton's update was that capability progress was happening faster than alignment research could keep up with; that biological brains may not be a particularly efficient implementation of intelligence, so silicon implementations could exceed them with relatively modest additional progress on the architectural and training-regime fronts; and that the academic community was failing to take the question seriously enough. He estimated, in interviews following the resignation, that there is a "10 to 20 percent" chance AI leads to human extinction within thirty years, a probability he was careful to frame as a personal estimate rather than a consensus figure (this figure has shifted upward in subsequent interviews; some 2024 sources report up to 50 per cent).

The function of Hinton's intervention was less to introduce new arguments than to make it socially permissible for senior machine-learning researchers to express the worries. Before 2023, expressing existential-risk concern within mainstream ML was professionally costly: the worries were associated with a particular sub-community and were treated as evidence of insufficient technical seriousness. After Hinton's resignation, that cost dropped sharply, and the discourse within the labs shifted in ways that the technical arguments alone had not produced over the previous decade.

What you should take away

The urgency case is not one argument but a family. Bostrom's instrumental convergence, Russell's structural specification problem, Christiano's distribution over failure modes, Yudkowsky's methodological conservatism and Hinton's empirical update each stand or fall on different premises. Reject one and you have not rejected the others.
The strongest version of the case does not require certainty about timelines or treacherous turns. Russell's argument follows from two premises, incomplete specification and sharp optimisation, that almost no working AI researcher disputes.
The intermediate position (Christiano) is what most frontier-lab researchers hold. The policy question is not whether to take alignment seriously but how to calibrate effort across the distribution of failure modes.
The criticisms of the urgency case are real and worth knowing: today's LLMs are not the goal-directed agents Bostrom posited; CIRL has not been operationalised at frontier scale; the moratorium proposal faces hard collective-action problems; and probabilities like Hinton's "10 to 20 percent" are estimates rather than measurements.
The right response to the urgency case is not deference to it but engagement with it. §16.20 presents the case for restraint, the argument that the present alarm is overstated and that the costs of slowing the field are themselves substantial, so that you can hold both views in mind before settling on your own.