Glossary

MBPP

MBPP (Mostly Basic Python Problems), released by Austin and colleagues at Google in 2021, contains 974 short Python programming problems, each with a one- to two-sentence natural-language description, a single function signature, and three test cases that the generated code must pass. Problems target the level of an entry-level programmer: list and string manipulation, arithmetic, basic data-structure usage, and standard library calls.

The standard split allocates 374 test problems, with the remainder for few-shot prompting and validation. Evaluation is pass@1 / pass@k identical to HumanEval. A more rigorous variant, MBPP+, was released by EvalPlus in 2023: it adds dozens of additional unit tests per problem (including edge cases that the original three tests miss) and re-grades models accordingly. MBPP+ scores are typically 10–15 points lower than MBPP for the same model.

Performance trajectory. Codex (12B) scored 47.3% in 2021. GPT-4 reached 80.1% at release. By mid-2024, Claude 3.5 Sonnet reached 90%+, Llama 3.1 405B 88%, and GPT-4o 89%. The MBPP+ variant is more discriminative: frontier models cluster in the 75–85% range as of late 2025, where MBPP itself is saturated above 90%.

Known issues. As MBPP's sister benchmark, it shares the same contamination risks: the dataset has been on the open web since release. The original three-test-per-problem grading is famously lenient, many "passing" solutions fail edge cases that human reviewers would catch immediately, which is why MBPP+ is now the recommended variant. Like HumanEval, MBPP measures only single-function correctness.

Modern relevance. MBPP and MBPP+ remain standard low-cost coding benchmarks for new small open models (Qwen Coder, DeepSeek-Coder, StarCoder, Granite, Llama Coder) but are no longer informative at the frontier. They are also widely used as a training signal for instruction-tuning and code-fine-tuning recipes, where rapid iteration on a cheap-to-grade benchmark is more important than discrimination at the absolute top of the leaderboard.

Reference: Austin et al., "Program Synthesis with Large Language Models", arXiv 2021; EvalPlus team, "Is Your Code Generated by ChatGPT Really Correct?", NeurIPS 2023 (the MBPP+ paper).

Related terms: HumanEval, SWE-Bench, CodeForces and Competitive Programming

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).