The visible-versus-hidden thinking tokens decision is the architectural choice frontier reasoning models face about whether to expose their internal chain-of-thought tokens to users and API consumers, or to keep them concealed behind summaries. As of 2026 the major systems split cleanly: claude-4 extended thinking, DeepSeek's deepseek-r1-zero and DeepSeek-R1, and Google's Gemini 2.5 Thinking expose the full chain; OpenAI's openai-o3 and the o-series in general hide it (see o1-hidden-cot).
The technical surface is similar in all systems: between the user prompt and the final answer the model generates extended reasoning tokens , typically 1k to 100k tokens, that explore the problem, try approaches, backtrack, and converge on an answer. The only difference is whether those tokens are returned in the API response.
Arguments for visible tokens.
Trust and verification. Users can read the chain and confirm the model's reasoning is sound, especially in high-stakes settings (medical, legal, scientific). A user-visible chain that contains a clear logical error is far more useful than an opaque wrong answer.
Debuggability. When the model is wrong, the chain pinpoints where reasoning went off-track. Anthropic's documentation explicitly recommends inspecting the chain when Claude's answer surprises.
Pedagogy. For tutoring, code review, and exposition, the chain is itself the product, students learn from seeing how the model approaches a problem.
Honesty research. If the chain is hidden, there is no external pressure to make it faithful to the actual reasoning; visible chains create a shared accountability surface.
Arguments for hidden tokens.
Distillation defence. The chain is the highest-value training signal a competitor could harvest. Once visible, it can be used to fine-tune smaller models, DeepSeek-R1's release seeded a wave of distilled open models within days. OpenAI hides chains in part to deny this avenue.
Faithfulness vs performativity. If the chain is visible, the model may learn to perform reasoning rather than to actually reason, making the chain a polished artefact rather than genuine internal exploration. Hiding the chain allegedly preserves its authenticity.
Safety filtering. Hidden chains can contain unsafe intermediate steps (considered-and-rejected harmful approaches, profanity, etc.) that a final-answer filter cleans up. Surfacing them risks exposing intermediate content the user should not see.
Cost and clarity. Most users do not want to read 50k tokens of model thought; surfacing them clutters the product.
Empirical observations. When DeepSeek's R1 was released with full visible chains, the chains turned out to be highly informative and largely faithful, readable English, clear reasoning, occasionally amusing self-talk ("wait, let me reconsider..."). Faithfulness research (Anthropic's Reasoning Models Don't Always Say What They Think, 2025) has found that even visible chains are not always perfectly faithful , models can reach conclusions that don't follow from the visible chain, but visible chains are far more faithful than no chain at all.
Hybrid options. Some systems offer both: a visible "thinking" panel by default, with the option to suppress it, plus an internal summary in the API. Anthropic's extended thinking allows the user to set a token budget for thinking and view the trace at will. This is increasingly the developer-preferred pattern.
The decision will probably converge over time as distillation defences become independent of chain visibility (e.g. through training-pipeline secrecy rather than chain secrecy) and as users come to expect transparent reasoning as a baseline product feature.
Related terms: Chain-of-Thought, o1 / Reasoning Models, o1's Hidden Chain of Thought, Claude 4 Family, OpenAI o3, DeepSeek R1-Zero, Process Supervision
Discussed in:
- Chapter 16: Ethics & Safety, Visible vs Hidden Reasoning