Glossary

OpenAI o3

OpenAI o3 is a frontier reasoning model announced by OpenAI on 20 December 2024 as the successor to o1. It extended the o-series paradigm of allocating substantial test-time compute to deliberate, step-by-step reasoning before producing a final answer.

Headline result. On the ARC-AGI semi-private evaluation, o3 reached 87.5% in its high-compute configuration and 75.7% in low-compute mode, against a prior frontier of roughly 30% for the best public systems. ARC-AGI was designed by François Chollet to resist memorisation and reward fluid abstraction, so the leap was widely interpreted as evidence that scaling reasoning compute, not just pre-training, unlocks new capabilities. o3 also posted strong results on Frontier Math (around 25%, where prior models scored under 2%), Codeforces (Elo above 2700), and GPQA Diamond (87%).

Method. Like o1, o3 is trained with reinforcement learning on chain-of-thought: the base model is rewarded for producing reasoning traces that lead to correct, verifiable outputs in domains such as mathematics, code execution, and formal proofs. The training signal comes from automatically checkable rewards (unit tests, theorem checkers, numerical answers) rather than human preference labels alone. At inference, o3 generates long internal reasoning sequences, often using thousands of thinking tokens, before emitting a short user-visible answer.

Compute costs. OpenAI disclosed that the low-compute ARC-AGI run cost roughly $20 per task; the high-compute configuration used about 172 times more compute per task, costing thousands of dollars per task. This made o3 the first widely discussed example of inference-time scaling as a deliberate product axis: customers could pay more per query to get higher accuracy, inverting the previous decade's trend of pushing compute into pre-training.

o3-mini and successors. A smaller, cheaper o3-mini shipped in early 2025 with selectable reasoning effort levels (low, medium, high). Through 2025 OpenAI iterated on the line with o3-pro and integrated o3-class reasoning into the GPT-5 family and the 2025-generation Codex agent products.

Significance. o3 is widely cited as the moment the field accepted that reasoning training is a distinct, compounding axis of capability beyond next-token pre-training and RLHF. It also catalysed the open replication efforts that produced DeepSeek R1 weeks later, since R1 used the same reinforcement-learning-on-reasoning recipe applied to an open base model.

The model raised fresh safety questions: long hidden chains of thought are harder to monitor, and o3's ability to plan multi-step actions made it a key building block for autonomous coding agents and the broader 2025 wave of computer-using AI.

Related terms: Reasoning Model Training, Thinking Tokens, Test-Time Compute Scaling, Chain-of-Thought, DeepSeek R1-Zero

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).