Glossary

SWE-Bench

SWE-Bench is a benchmark introduced by Carlos Jimenez and colleagues at Princeton and the University of Chicago in late 2023, with the journal version appearing at ICLR 2024 (Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?"). It evaluates AI systems on the practical task of fixing real software bugs.

Construction. Each task is built from a closed issue and its accompanying merged pull request in a popular open-source Python repository (django, sympy, scikit-learn, matplotlib, requests, sphinx, astropy, pytest, flask, pylint, xarray, pyvista). The benchmark provides:

  • A natural-language issue description.
  • The repository state immediately before the fix.
  • A set of tests that the merged PR makes pass (and which fail beforehand).

The model must generate a patch that, applied to the repository, causes the failing tests to pass without breaking other tests. Evaluation is binary per task and fully automated.

Variants.

  • SWE-Bench Lite is a 300-task subset focused on shorter, more localised fixes, used for cheap iteration.
  • SWE-Bench Verified, released by OpenAI in August 2024, is a 500-task subset audited by human software engineers to remove ambiguous specifications and brittle tests. It has become the standard frontier evaluation, since results on the original benchmark were sometimes contaminated by data quality issues.
  • SWE-Bench Multimodal adds visual UI bug-fix tasks.

Difficulty. Tasks involve navigating large unfamiliar codebases, understanding existing abstractions, locating the relevant file, formulating a fix and making it consistent with style and tests. Mean repository size is tens of thousands of files. The tests are real and unforgiving.

Solver landscape. SWE-Bench catalysed the AI software engineer wave.

  • Devin (Cognition Labs, March 2024) used SWE-Bench in its launch demo, scoring around 13.9%.
  • AutoCodeRover, SWE-Agent and Aider were early open-source agents that broke 20-30%.
  • Anthropic's Claude 3.5 Sonnet with a custom harness reached 49% on SWE-Bench Verified in late 2024.
  • Claude Code (Anthropic's CLI agent) with Claude 4 Sonnet and Opus 4 pushed scores past 70% on Verified through 2025.
  • OpenAI Codex (2025 generation) and GPT-5-based agents are competitive at the top of the leaderboard.

Significance. SWE-Bench is the most influential evaluation of agentic coding. It tests not just code generation but tool use, planning, retrieval, and the ability to recover from failure. Strong scores correlate with real productivity gains for developers, which made SWE-Bench numbers a standard headline metric in 2024-2025 model releases. Critics note that test-passing is a loose proxy for code quality and that some tasks can be solved by superficial edits, motivating the Verified subset and ongoing benchmark refinement.

The benchmark's design (real repositories, real issues, executable tests) is now widely copied for other domains.

Related terms: Devin / AI Software Engineer, OpenAI Codex (2025 generation), Claude 4 Family, Reasoning Model Training

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).