Applications: 17.7   Code generation

Dr Chris Paton

17.7 Code generation

Of all the knowledge-work applications of artificial intelligence, code generation has produced the most visible and most heavily measured productivity changes. The transformation has been swift. In June 2021, when GitHub Copilot first appeared in technical preview, the idea that a language model could write usable production code was a curiosity. Five years later, almost every professional software engineer has access to a coding assistant, the most ambitious organisations are running autonomous coding agents in continuous integration, and benchmark scores on real GitHub issues have moved from a few per cent to over three quarters. The productivity gains on routine work (autocomplete, scaffolding, test writing, refactoring, documentation) are now well established at roughly 25 to 50 per cent. The gains on hard problems (novel algorithms, large architectural changes, debugging across unfamiliar codebases) remain modest and depend heavily on the quality of human review.

The previous section described how machine learning is reshaping materials science by searching a vast combinatorial space for stable compounds. This section turns to a different kind of search: the search through the space of programs that satisfy a specification. Code generation sits at the intersection of natural-language understanding and formal reasoning, and it has become the proving ground for agentic systems, language models that plan, take tool actions, observe results and iterate.

Tools

The landscape of code-generation tools in 2026 spans editor plug-ins, full integrated development environments, terminal-based agents and cloud-hosted autonomous workers. The first widely used product was GitHub Copilot, launched in June 2021 in technical preview and made generally available in June 2022. Copilot was originally based on OpenAI's Codex model, a 12-billion-parameter GPT-3 derivative fine-tuned on public GitHub repositories. It began as inline autocompletion, suggesting the next few lines as you typed, and has since added Copilot Chat (a conversational sidebar), Copilot Workspace (an issue-to-pull-request agent), and Copilot Edits (multi-file changes). By 2024, GitHub reported over 1.3 million paid individual subscribers and more than 50,000 organisations using Copilot for Business.

Cursor, released by Anysphere in 2023, is a fork of Visual Studio Code with much deeper artificial-intelligence integration: a Composer panel that can plan and apply multi-file edits, an indexed view of the whole repository, and a tab-completion model trained specifically for code. Cursor has become the favoured environment for many professional engineers who want agentic editing without leaving their IDE.

Claude Code, released by Anthropic in 2024, is a terminal-based agent. Rather than living inside an editor, it operates from the command line with full filesystem and shell access, and is designed for longer autonomous runs, running tests, opening pull requests, performing migrations. It pairs with the Claude family of language models and emphasises explicit human-in-the-loop checkpoints.

Devin, announced by Cognition Labs in March 2024, was the first product to be marketed as an autonomous "AI software engineer" capable of taking a task from issue to merged pull request. Devin runs in a sandboxed cloud environment with its own browser, shell and editor, and it became the public face of the agentic-coding wave even before its underlying scores caught up with the hype.

Replit Agent, launched in 2024 inside Replit's cloud IDE, takes a natural-language description and builds a runnable project, deploys it, and iterates on user feedback, emphasising the path from idea to deployed application. It is particularly popular with non-professional builders.

Codex (the 2025 product) is OpenAI's coding-specialised reasoning agent, distinct from the 2021 model of the same name. It runs as a cloud-hosted worker that can be given a repository, a task and a budget, and returns a pull request. By early 2026 it was scoring above 80 per cent on SWE-Bench Verified in published evaluations.

Other significant tools include Aider (open-source CLI), Cline (open-source VS Code extension), Continue (open-source IDE plug-in), Windsurf (Codeium's IDE), and JetBrains' AI Assistant. The market is unusually crowded because the underlying models are commodities and most of the differentiation lies in the scaffolding and user experience.

Benchmarks

Three benchmarks dominate the literature. HumanEval, introduced by Chen and colleagues at OpenAI in 2021, contains 164 hand-written Python problems with unit tests. It captures the easy end of code generation, short, self-contained functions with clear specifications. HumanEval saturated quickly: GPT-4 scored 67 per cent in 2023, and by 2025 the strongest models were above 95 per cent. It is now used mainly as a sanity check.

SWE-Bench, introduced by Jimenez and colleagues at Princeton in 2024, evaluates whether a system can produce a passing patch for a real GitHub issue against a real codebase. It draws issues from twelve popular Python projects, requires the model to navigate a multi-file repository, understand the bug, write a fix, and pass the project's existing test suite. SWE-Bench Verified is a 500-issue subset curated for solvability and clear test signal. The progression has been rapid. In late 2023, GPT-4 with simple prompting solved roughly 2 per cent of SWE-Bench. In March 2024, Devin scored 13.86 per cent on a 25 per cent random subset of SWE-Bench (SWE-Bench Verified, the 500-issue curated subset, was released later that year). Later in 2024, Anthropic's Claude 3.5 Sonnet with the open-source SWE-agent scaffold reached 33 per cent. By late 2024, Claude 3.5 Sonnet (new) and OpenAI's o1 and o3 models reached 49 per cent and 71 per cent respectively. By 2025, OpenAI's Codex agent and Anthropic's Claude 4 family pushed published numbers above 80 per cent. The April 2026 leaderboard shows the strongest published configurations clearing 82 per cent (preview models reportedly reach above 90 per cent).

Codeforces competitive-programming ratings have become a third anchor. OpenAI's o3 reached an estimated rating of around 2700 in 2025, placing it in the top 0.1 per cent of human competitors and at the "international grandmaster" tier. This kind of performance reflects deep algorithmic reasoning under contest constraints, not the broader engineering capability that SWE-Bench attempts to capture, but it is a useful upper bound on what agents can do when the problem is well specified.

Productivity studies

Three results define what is empirically known about coding-assistant productivity. Peng and colleagues' 2023 randomised study at GitHub assigned 95 developers a JavaScript HTTP-server task; Copilot users completed it 55 per cent faster than the control group, with the largest gains for less experienced developers. The result has been replicated for narrower coding tasks (autocompletion, boilerplate, test scaffolding), with most studies reporting 25 to 55 per cent speed-ups.

A second key result is more cautionary. A 2023 Stanford study by Perry and colleagues found that developers using an AI assistant produced more security vulnerabilities than those without one, and were more confident their code was secure. The headline number, that AI-assisted developers wrote less secure code, has been contested in follow-up work, but the broader finding is consistent: assistants produce plausible-looking code that humans review less carefully than they would review their own first drafts.

A third strand examines aggregate workflow effects. Brynjolfsson, Li and Raymond's 2023 NBER paper on a contact-centre deployment of an LLM-based agent assistant (a different setting, but methodologically the cleanest large-N field study) found a 14 per cent productivity increase, again concentrated among novice workers. McKinsey's 2023 evaluation of generative-AI coding tools reported up to 50 per cent time savings on documentation, 45 to 50 per cent on code generation, and 20 to 30 per cent on refactoring, alongside more modest gains on planning, design and complex debugging. The pattern across all of these studies is consistent: large gains on routine, specifiable work; smaller gains on tasks that require system-level understanding; concentration of gains among less experienced workers.

Limits

The summary of where coding agents fail is unsentimental. Hallucinations remain common: agents invent API calls that do not exist, import modules that are not installed, and produce confidently wrong syntax in less common languages such as Rust, Zig, OCaml and Haskell. Hallucinations are most damaging when the resulting code compiles but is semantically wrong, a hallucinated function with a plausible signature is harder to catch than one that fails to parse.

Security is a persistent weakness. Models trained on public code have learned the bad habits along with the good. They produce SQL injection, cross-site scripting, hard-coded secrets, weak random-number generation, missing authorisation checks, and unsafe deserialisation. The security context is rarely visible in the immediate prompt, and assistants do not by default ask "who is the attacker?" Security review must remain a human responsibility, and agents should be configured with a security-aware system prompt and ideally a static-analysis pass.

Architecture is the next limit. Agents handle local edits well (change this function, add this test, refactor this loop) but struggle with cross-cutting changes: introducing a new abstraction across twenty files, reshaping a database schema and the migrations that depend on it, or replacing a synchronous interface with an asynchronous one. The failure mode is usually partial: the agent makes the change in the obvious places and silently leaves five callers untouched, breaking the build or, worse, leaving the bug latent.

Hard problems still need humans. Designing a novel algorithm, debugging a Heisenbug that surfaces only under load, choosing between two valid architectural directions, deciding what should not be built: these remain firmly human. Agents are excellent assistants for the engineer who already knows what to do; they are unreliable substitutes for the engineer who does not.

Where this is going

The trajectory is clear. Scaffolding is improving (better planning, better self-review, longer effective horizons), and base models continue to scale. Tool use is becoming richer: agents now run tests, browse documentation, query databases and exercise APIs. Multi-agent setups, where one agent writes code and another reviews it, are appearing in production. Cost is collapsing: by 2026 the cost of running an autonomous coding agent on a typical issue is in the order of a few dollars.

The hard line remains "fix this real bug end-to-end without supervision." On SWE-Bench Verified that line currently sits near 78 per cent, far short of replacing a junior engineer working in an unfamiliar codebase, but the curve has not flattened. If the next eighteen months follow the last eighteen, ninety per cent on SWE-Bench is plausible, and the bottleneck shifts to harder benchmarks (SWE-Bench Multimodal, Multi-SWE-Bench across languages, longer-horizon tasks) and to questions that benchmarks cannot capture: taste, judgement, the willingness to say no.

What you should take away

Code generation is the most heavily measured AI application; productivity gains of roughly 25 to 50 per cent on routine work are well established and concentrated among less experienced developers.
The dominant tools, GitHub Copilot, Cursor, Claude Code, Devin, Replit Agent, OpenAI's 2025 Codex, differ more in scaffolding and user experience than in underlying model capability.
SWE-Bench Verified scores have moved from about 2 per cent in late 2023 to over 75 per cent by 2025–2026, demonstrating the rapid maturation of agentic coding.
The persistent failure modes are hallucinated APIs, insecure code that looks plausible, partial cross-cutting refactors, and confident wrongness on novel algorithmic problems.
Senior engineers remain essential for design, security review, and architectural judgement; the question of whether junior software engineering survives as an entry point to the profession is unsettled and economically important.