Chatbot Arena (originally LMSYS Chatbot Arena, since 2024 spun out as LM Arena) is a live, crowdsourced, pairwise human-preference benchmark for chat models. Users visit chat.lmsys.org, type a prompt, and receive responses from two anonymous models side-by-side. The user picks the better response (or "tie" / "both bad"); only after voting are the two model identities revealed. Votes accumulate into an Elo rating (later switched to a Bradley-Terry style logistic model), and the leaderboard ranks models by their estimated rating with bootstrap confidence intervals.
The benchmark has unique strengths: prompts are real user prompts (not curated benchmarks), so it captures distribution shift toward how people actually use chatbots; anonymity removes brand bias; scale (over 2 million votes by late 2024) keeps confidence intervals tight. Ratings update continuously and are reported per-category (Hard prompts, Coding, Math, Long Queries, Multilingual, Vision Arena), so models can have very different ranks across slices.
Performance trajectory. When Arena launched in May 2023, GPT-4 topped the leaderboard at Elo ~1180 with Claude 1, Vicuna-13B, and PaLM-2 trailing well below. By mid-2024 the GPT-4o / Claude 3.5 Sonnet / Gemini 1.5 Pro cluster sat at ~1270 and the open-weights frontier (Llama 3.1 405B, Qwen 2.5 72B) at ~1240. OpenAI o1-preview in September 2024 entered at ~1340. By late 2025, Gemini 2.5 Pro, Claude 4 Opus, o3, and GPT-5 all cluster at ~1400–1450. The top of the leaderboard moves by 10–30 Elo points with each major release.
Known issues. Chatbot Arena measures human preference, not correctness, verbose, friendly, well-formatted answers can beat tersely-correct ones. Coverage is heavily English-skewed (although the multilingual sub-arena addresses this). Stylistic gaming (longer answers, markdown formatting, bullet points) demonstrably inflates Elo without improving truthfulness, and several labs have been accused of optimising directly for Arena rather than for downstream usefulness. The pre-launch secret-name practice (where labs test unreleased models under aliases) has also become controversial, it gives early signal to the lab but biases the leaderboard against recently-launched competitors.
Modern relevance. Despite its limitations, Chatbot Arena is the single most-watched live LLM leaderboard in 2024–2026 and the de-facto headline metric for "is this new release any good in practice". It complements automated benchmarks like MMLU-Pro and LiveBench by capturing real-user satisfaction.
Reference: Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference", ICML 2024; live leaderboard at lmarena.ai.
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics