Thinking Tokens, Glossary, Textbook of AI

Thinking tokens are special token sequences that mark the boundary between a model's internal reasoning and its final user-facing output. They are the lexical scaffolding of the reasoning-training paradigm: they tell the model where to think, where to answer, and let training and inference systems treat the two regions differently.

Form. The exact tokens vary by model. DeepSeek R1 uses <think>...</think> followed by <answer>...</answer>. OpenAI o-series uses internal chain-of-thought delimiters that are not exposed to the API by default. Claude's extended thinking uses an internal block that surfaces summary thinking traces in the response. The pattern is consistent: a reasoning region of arbitrary length is followed by a comparatively short answer region.

Training role. Reasoning training rewards the model based on the content of the answer region, not the reasoning region. This decouples the two:

The reasoning region is unconstrained: the model can think in any style, length or language, provided it leads to a correct answer.
The answer region is constrained by format and verifier requirements (a number, a code block, a proof).

During RL, gradients propagate back through both regions, so the model learns reasoning patterns that statistically lead to correct answers. The thinking tokens themselves are special and are typically masked from certain losses (for example, format-rewards that check answer formatting only).

Inference role. Thinking tokens enable several deployment patterns.

Variable test-time compute: the model autonomously decides how long to think based on problem difficulty, allocating more thinking tokens to harder questions. This implements test-time compute scaling without explicit configuration.
Visible vs hidden reasoning: production systems can choose to expose the thinking region (Claude 4's extended thinking, Gemini 2 Flash Thinking) for interpretability, or hide it (OpenAI o-series production API) to protect training signal and avoid distillation. Some APIs expose only a summary of the reasoning.
Streaming UX: clients can show "Thinking..." to the user while the reasoning region streams, then render the answer cleanly when the closing tag arrives.

Distillation and security. Visible thinking tokens can be harvested by competitors and used as training data for distilled reasoning models, which is one reason OpenAI hides the o-series traces. DeepSeek R1's open thinking traces immediately seeded a wave of distilled R1-style reasoning models on smaller bases. The trade-off between transparency and competitive moat shapes how labs choose to expose the region.

Interpretability. Whether thinking tokens reflect the model's "real" reasoning is a research question. Mechanistic interpretability work has shown that hidden internal computation can diverge from the surface chain of thought, and that models can produce deceptive reasoning that justifies an answer reached for other reasons. Anthropic's faithfulness studies on Claude's extended thinking, and similar work on o-series, are an active research thread.

Position. As of early 2026 thinking tokens are a standard architectural and product feature of frontier models. They are a small lexical innovation with large consequences: they enable the reasoning-training paradigm, give users a new compute knob, and turn chain-of-thought from a prompting trick into a first-class part of the model's interface.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).