METR and RE-Bench, Glossary, Textbook of AI

METR (Model Evaluation & Threat Research, formerly ARC Evals) is a non-profit AI evaluation lab spun out of the Alignment Research Center in 2023. It runs the most-watched agentic capability evaluation programme outside the frontier labs themselves, with formal evaluation contracts with Anthropic, OpenAI, and Google DeepMind. Its flagship benchmark is RE-Bench (Research Engineering Benchmark), introduced in 2024.

RE-Bench consists of 7 hand-crafted research engineering tasks drawn from real ML R&D workflows: optimising a CUDA kernel, scaffolding a fine-tuning pipeline, reverse-engineering a training run, debugging an RL implementation, and so on. Each task takes a human ML engineer between 4 and 32 hours to complete and ships with hand-graded scoring rubrics that measure a continuous achievement score (not just success/failure). Models are run as agents in a sandboxed Linux environment with shell access, file system, Python, and GPU compute.

METR also publishes time-horizon evaluations: how long a coherent task can a model complete? Their 2024 retrospective showed that since GPT-4 (March 2023), the time horizon for autonomous task completion has roughly doubled every 4–7 months. GPT-4 reliably completed tasks of around 8 minutes; Claude 3.5 Sonnet 30 minutes; o1 ~2 hours; Claude 4 Opus and o3 around 4–8 hours on RE-Bench-style tasks. If the trend continues, by 2027 frontier systems would complete 40–80 hour tasks autonomously, comparable to a week of human work.

Performance trajectory. On RE-Bench (8h budget), human ML engineers score around 1.0 (the per-task normalisation baseline). Claude 3.5 Sonnet (Oct 2024) scored 0.20. OpenAI o1 0.40. Claude 4 Opus ~0.85. OpenAI o3 with full agent scaffolding matched or exceeded the human baseline (~1.05) on the 8-hour budget in mid-2025, METR's first formal "human parity at hours-scale R&D" finding.

Known issues. The benchmark is expensive to run ($100k+ in compute and human-eval costs per system per round) and the task set is small (7 tasks). Tasks are also specific to ML R&D, not a general capability test. METR releases redacted summaries of their pre-deployment evaluations rather than detailed traces, which limits independent reproduction. Time-horizon extrapolations have been controversial, critics argue they overweight the recent reasoning-model jump.

Modern relevance. METR and RE-Bench are the canonical reference points for the AI R&D automation discussion, frequently cited by AISI reports, EU AI Act compliance work, and AGI-timelines discourse. Their evaluations are increasingly invoked in pre-deployment safety reviews.

Reference: METR, "Evaluating Frontier Models for Dangerous Capabilities", METR technical reports 2023–2025; metr.org.

Related terms: SWE-Bench, ARC-AGI, OpenAI o3, Claude 4 Family

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).