Ethics & Safety: 16.17 Responsible Scaling Policies

Dr Chris Paton

16.17 Responsible Scaling Policies

Responsible Scaling Policies, or RSPs, are public commitments by frontier AI laboratories that tie continued development to the demonstrated ability to manage specific risks. The basic shape is the same across labs. Each policy lists capability thresholds, concrete, behaviour-based descriptions of what a model can do, such as "the model can autonomously replicate itself across servers" or "the model can substantially uplift a non-expert attempting to synthesise a chemical or biological weapon". Each threshold is paired with a set of safety, security and oversight commitments that must be in place before a model crossing that threshold can be trained further or deployed. If the lab cannot meet the commitments, the policy says, it will pause. Anthropic published the first such document in September 2023; OpenAI's Preparedness Framework and DeepMind's Frontier Safety Framework followed within a year and use compatible vocabulary. They are voluntary, lab-specific and revised regularly, and they sit awkwardly between a moral promise and an industrial standard.

In §16.16 we looked at the harms generated when synthetic media flows out of a model into the world: deepfakes, watermarking, content provenance. RSPs zoom out from any one harm and ask the question one level up. Given that capability is rising and that some capabilities are dangerous in kind rather than just in degree, what does a frontier lab owe the public about the conditions under which it will keep building? The answer that the labs have collectively converged on, for now, is the RSP.

What an RSP says

An RSP has three logical components, regardless of which lab wrote it. First, a list of capability thresholds. These are not benchmark numbers and not measures of intelligence; they are dangerous-capability descriptors. A typical threshold reads like a description of what an attacker could do with the model in their hands, or what the model could do unsupervised. The published thresholds in 2026 cluster around four families: cyber-offence (the model can write and deploy novel exploits at scale), CBRN uplift (the model materially helps a moderately competent actor build a chemical, biological, radiological or nuclear weapon), autonomous replication (the model can copy itself and acquire resources without human help), and large-scale persuasion or manipulation (the model can run sophisticated influence operations at low cost).

Second, a list of safety commitments attached to each threshold. These commitments are operational, not aspirational. They specify what evaluation suite must be passed, what security posture must be in place to protect model weights from theft, what monitoring is required during deployment, what red-teaming must be completed, and what governance approval is required before a model crossing the threshold can be released. The commitments scale: a model crossing a low threshold needs the kinds of controls the lab uses today, while a model crossing a high threshold may require third-party verification, hardware security modules, restricted deployment, and a formal safety case, a structured argument that the residual risk is acceptable.

Third, a pause condition. If the lab develops a model that crosses a threshold for which the safety commitments cannot be met, the policy says the lab will stop further training and deployment of that model class until the commitments can be met. The pause is what distinguishes an RSP from a marketing brochure. The whole point is to put a procedural brake on development that the lab agrees to in advance, when the temptation is weakest, and that becomes binding when the temptation is strongest. Whether the brake actually holds when commercial pressure intensifies is the open question that hangs over every RSP, and we shall return to it under critiques.

Anthropic's RSP

Anthropic's RSP, originally published in September 2023 and revised twice since, is built around the idea of AI Safety Levels (ASLs), borrowed by analogy from the biosafety community's BSL-1 to BSL-4 framework. The analogy is not perfect, ASL-1 systems are not "safe" in any deep sense, and the levels are about deployment controls rather than physical containment, but the borrowing carries the right intuition: as the thing you are working with becomes more dangerous, the conditions under which you can work with it must become more demanding.

ASL-1 describes systems that pose no meaningful uplift in catastrophic capability. Smaller language models, narrow image classifiers, traditional machine learning pipelines and most pre-2020 systems sit at this level. Anthropic does not claim ASL-1 is risk-free, only that the risks are not the catastrophic ones the policy is designed to govern.

ASL-2 describes systems that show rudimentary signs of dangerous capability but that do not materially uplift a malicious actor over what they could obtain from freely available resources such as a search engine or a published textbook. As of 2026, ASL-2 is the level at which most deployed frontier models sit. The required controls are standard: red-teaming before release, published use policies, abuse monitoring, refusal behaviours for clearly harmful requests, and ordinary information security practices.

ASL-3 is where things tighten. An ASL-3 model is one that does materially uplift a malicious actor in CBRN or autonomous-cyber-offence capability, or that shows meaningful agentic capability that could be misused. The required controls include hardened security against weight exfiltration (the assumption being that nation-state-level actors will try to steal the weights), mandatory pre-deployment red-teaming by external evaluators including the AISIs, restricted API access with stronger authentication and monitoring, and explicit deployment containment. Anthropic's published transparency documents through 2024 and 2025 reported several "near misses" where models scored above ASL-3 thresholds on internal evaluations and required delayed release until the corresponding controls were operational.

ASL-4 is reserved for systems posing autonomous catastrophic risk: models that could escape during training or deployment, autonomously acquire resources at scale, or undermine human oversight in ways the lab cannot reliably detect. Required controls at ASL-4 include formal safety cases, interpretability-based oversight (not merely behavioural red-teaming), and no deployment without third-party verification. As of 2026 no system has been classified as ASL-4 and the controls themselves are not fully developed.

ASL-5 sits at the top: systems exceeding human capability across all relevant domains. The current RSP says explicitly that ASL-5 controls are not yet defined, and the policy commitment is to define them before any system reaches the level. The frankness of this admission is itself worrying: the commitment to invent the brake before driving the car relies on the lab knowing how fast the car is going.

OpenAI's preparedness framework

OpenAI's Preparedness Framework, first published in late 2023 and updated in 2024 and 2025, organises things slightly differently. Instead of nested safety levels, it scores models in four categories: cybersecurity, chemical/biological/radiological/nuclear, persuasion, and model autonomy. Each category is rated on a four-level scale, low, medium, high, critical, based on capability evaluations specific to that category.

The deployment rule is a function of the post-mitigation score. A model whose highest category sits at medium or below can be deployed under standard release procedures. A model that scores high in any category may be deployed but only with category-specific mitigations and additional safeguards. A model that scores critical in any category cannot be deployed at all under the current framework, and further training of such a model is restricted until the company can show that the score can be brought down through mitigations. OpenAI publishes summary scorecards alongside major releases, including the elicitation methods used during evaluation, which allows external observers to see roughly which categories drove a release decision and which mitigations were applied.

DeepMind's Frontier Safety Framework, the third major published policy, takes a similar shape: a set of Critical Capability Levels across CBRN, cyber-offence, autonomy and machine-learning R&D, paired with mitigations that scale with the level. The three frameworks are not identical and the thresholds are not cross-calibrated, a model that scores "high" in OpenAI's persuasion category is not necessarily ASL-3 in Anthropic's vocabulary, but the structure is the same, and informal coordination between the labs and the AISIs has been pulling the categories into rough alignment over time.

Critiques

The critiques of RSPs are persistent and largely fair. The first is that they are voluntary. There is no regulator that can compel a lab to honour its policy and no court in which the policy can be enforced as a contract. A lab can rewrite its own RSP at any time; in practice all three labs have done so, sometimes loosening commitments in the process. The published change-logs are useful but they are also embarrassing.

The second is that the thresholds are self-defined. The lab decides what counts as "materially uplifting a malicious actor" and the lab designs the evaluations that decide whether a model crosses the threshold. As the META review of frontier evaluations METR, 2025 observed, the same model can score very differently depending on the harness, the prompt, the tooling, the chain-of-thought structure and the elicitation budget. Publicly cited benchmark scores systematically underestimate the capability available to a determined adversary with a better scaffold.

The third is that capability descriptions are fuzzy. "Substantially uplift" is not a number. "Autonomous replication" admits of many readings, ranging from "the model can copy a file" to "the model can rent compute, monetise itself, and survive shutdown attempts". Whether a given evaluation result crosses a threshold often comes down to expert judgement, which means it comes down to people whose careers depend on the lab.

The fourth, which Yudkowsky and others press hardest, is that commercial pressure runs entirely in the direction of under-classification. A lab that declares its model ASL-3 must absorb cost: external red-teaming, deployment delays, restricted access, hardware security upgrades. A lab that declares the same model ASL-2 ships sooner. The structural incentive is to find a reading of the evaluation that supports the cheaper label. The defenders of RSPs reply that this incentive has not yet caused obvious capture, that the published transparency documents do report uncomfortable findings, and that voluntary commitments with imperfect enforcement are better than no commitments at all. Whether that defence holds for another five years of capability growth is the live question.

Where this is going

External accountability is starting to fill the gap. The EU AI Act, in force from August 2024 with phased application through 2026, imposes requirements on general-purpose AI models above a compute threshold: model documentation, transparency on training data, copyright compliance, and for "systemic-risk" models, model evaluations, adversarial testing, incident reporting and cybersecurity safeguards. The United States has the NIST AI Risk Management Framework, several executive orders setting reporting requirements for frontier training runs, and emerging Congressional interest. Japan, Canada and the United Kingdom have variations on these themes.

The most important institutional development is the AI Safety Institutes. The UK AISI was established in late 2023; the US AISI followed in 2024; Japan, Canada, Singapore and the EU have analogues either operating or planned. AISIs evaluate frontier models pre-release under formal access agreements, run their own elicitation methods, and publish technical findings. They do not yet have the power to block a release, but they constitute the first external technical capacity that can independently challenge a lab's own evaluation results. Over time, the most plausible path is that AISI evaluation becomes a precondition for deployment under statute, with RSPs as the lab-internal apparatus that feeds into that external review.

What you should take away

An RSP commits a lab in advance. It pairs concrete dangerous-capability thresholds with operational safety commitments and a pause condition that triggers when the commitments cannot be met.
The three big frameworks share structure but not calibration. Anthropic's ASL-1 to ASL-5, OpenAI's four-by-four category-and-level scorecard, and DeepMind's Critical Capability Levels rhyme rather than agree, and a model called "high" by one lab is not automatically the same as one called "ASL-3" by another.
Capability evaluations are the load-bearing bit, and they are fragile. Elicitation matters more than the published number; a determined adversary will get a higher score than the release notes suggest.
Voluntary commitments have structural weaknesses. Self-defined thresholds, self-designed evaluations and commercial pressure to under-classify are not theoretical concerns, and revisions to published RSPs over time tend to loosen rather than tighten.
External accountability is arriving. AISIs, the EU AI Act, NIST and analogous institutions are moving from observation toward soft enforcement, and the most likely future is a hybrid in which lab RSPs feed into a regulated pre-deployment review rather than being the only line of defence.