LiveBench, introduced by White and colleagues in mid-2024, is a contamination-resistant benchmark designed around the principle that questions should be drawn from sources released after the model's training cutoff. The benchmark refreshes its questions monthly, sourcing new items from arXiv preprints, recent IMO/Putnam problems, news articles, USACO contests, and freshly-released datasets. The historical leaderboard retains older question sets so models can be compared on a fixed slice, but the headline score uses only recent questions.
The benchmark covers six task categories: Math (AMC/AIME/Olympiad-style), Reasoning (Web of Lies, Zebra Puzzles, Spatial reasoning), Coding (LiveCodeBench-style problems from new contests), Language (typo-fixing, plot-unscrambling), Data Analysis (table-transformation, summarisation), and Instruction Following (precise multi-constraint instructions). Each category has its own automatic grader. The headline score is the mean across categories.
Performance trajectory. At launch (June 2024), GPT-4o topped the leaderboard at 53.7, with Claude 3.5 Sonnet at 52.4, Gemini 1.5 Pro at 39.6, and the open-weights frontier (Llama 3.1 405B) at 40.5. As reasoning models entered: OpenAI o1 62.8 (October 2024), Claude 3.5 Sonnet (new) 57.1, DeepSeek-R1 66.3, OpenAI o3 79.4, Gemini 2.5 Pro 73, and by late 2025 Claude 4 Opus, Grok 4, and GPT-5 all cluster at 75–82.
Known issues. The monthly-refresh model is the benchmark's main strength but also a weakness: scores are not directly comparable across months, so longitudinal claims need careful slice-aware analysis. Some categories (Language, Instruction Following) are noisier than others. The reliance on automatic graders (including LLM-as-judge for some sub-tasks) introduces its own biases. As LiveBench grows in influence, labs may begin training on similar-distribution data, gradually compromising the contamination guarantee.
Modern relevance. LiveBench is the most-cited contamination-resistant benchmark of 2024–2026 and a primary reference point on most frontier-model launches. It complements MMLU-Pro (broader knowledge) and Chatbot Arena (human preference) by providing a contamination-clean automated headline.
Reference: White et al., "LiveBench: A Challenging, Contamination-Free LLM Benchmark", arXiv 2024; livebench.ai.
Related terms: MMLU-Pro, Chatbot Arena, CodeForces and Competitive Programming, AIME
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics