HumanEval, Glossary, Textbook of AI

HumanEval was introduced by Chen and colleagues at OpenAI in the 2021 Codex paper. It contains 164 hand-written Python programming problems, each consisting of a function signature, a natural-language docstring, several worked examples, and a hidden suite of unit tests. The model is prompted with the signature plus docstring and must complete the function body. A solution is correct if it passes all hidden tests.

The headline metric is pass@k: the probability that at least one of k independent samples passes all tests. Pass@1, pass@10, and pass@100 are reported, computed via an unbiased estimator that draws a larger sample (e.g. n=200) and analytically corrects for the smaller k. Problems span string manipulation, list processing, basic algorithms, and elementary numerics. Average problem length is around 7 lines of solution code.

Performance trajectory. The original Codex 12B scored 28.8% pass@1 in 2021. PaLM-Coder reached 36% in 2022. GPT-4 crossed 67% at release in March 2023. Claude 2 reached 71.2%. The 2024 frontier (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1 405B) all reported 88–92%. Claude 3.5 Sonnet specifically crossed 92%. By mid-2025, OpenAI o1, o3, DeepSeek-V3, and Qwen 2.5 Coder 32B all reported 94–98%, and the benchmark is universally considered saturated.

Known issues. HumanEval problems and solutions have been on GitHub since 2021 and appear in nearly every code training corpus. The 164-problem size is small enough that a single mis-graded test shifts the score visibly. Many problems also have multiple valid solutions of varying style, pass@k captures functional correctness only, not idiomatic quality, security, or efficiency. The benchmark also evaluates only isolated functions, not multi-file projects, dependencies, or real-world software engineering tasks, a gap addressed by SWE-Bench.

Modern relevance. HumanEval still appears on every model card for backwards compatibility but is no longer a discriminative benchmark. Modern coding evaluation defers to SWE-Bench Verified, LiveCodeBench, MBPP+, and the competitive-programming benchmarks (CodeForces, USACO).

Reference: Chen et al., "Evaluating Large Language Models Trained on Code", arXiv 2021.

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).