LLM Fundamentals: Part 6 -- Extended Thinking | BDIGITAL

This is Part 6 of the LLM Fundamentals series.

In Post 5, I drew a line between prompt engineering and harness engineering. Prompt engineering reduces ambiguity in what you send to the model. Harness engineering builds systems around it. Extended thinking sits right on that boundary, because it started as a prompting technique and became an API feature.

By 2026, all three major vendors expose reasoning as a tunable parameter. OpenAI’s Responses API takes reasoning.effort from minimal through xhigh. Gemini 2.5 takes a thinkingBudget token integer; Gemini 3.x replaced it with thinkingLevel. Anthropic’s extended thinking takes a budget_tokens integer plus an effort tier. The interesting distinction is not whether reasoning is controllable. It is everywhere. The distinction is how reasoning interacts with tool use. Anthropic returns thinking blocks as first-class content that flow through tool-use chains alongside text and must be passed back unchanged. OpenAI and Gemini return reasoning summaries the developer can ignore. That design choice is what makes this post worth its own entry.

Before extended thinking, developers used a prompting trick called chain of thought: add “think step by step” to the prompt and the model would reason its way to an answer. It worked because generating intermediate reasoning tokens conditions the probability distribution (from Post 2) toward better final tokens. But you could not control depth, separate reasoning from response, or stop reasoning from eating context window. Extended thinking solves all three by moving reasoning from prose into a parameter.

Extended Thinking: Chain of Thought as Infrastructure

Extended thinking takes the same idea and moves it into the API. Instead of asking the model to reason in its response, you flip a parameter and Claude gets a dedicated reasoning scratchpad. It produces thinking content blocks with its internal reasoning, followed by text content blocks with the actual response.

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    messages=[
        {"role": "user", "content": "Prove that there are infinitely many primes."}
    ],
)

With thinking: {type: "adaptive"}, Claude dynamically determines when and how much to think based on complexity. A simple factual question might skip thinking entirely. A proof or a multi-step debugging problem gets deep reasoning. You do not need to decide the budget upfront.

This is the shift I described in Post 5. What was once prose in a prompt (“please reason step by step”) is now a parameter in the harness (thinking: {type: "adaptive"}). Same cognitive mechanism, different control surface.

How Tokens Flow

Extended thinking produces two types of output. Thinking tokens contain the internal reasoning. Text tokens contain the response you show to users. Both count as output tokens for billing, and both compete for your max_tokens budget.

Most Claude 4.x models return summarized thinking by default, not the raw internal reasoning. On Claude Opus 4.7 the default is omitted, so you have to set display: "summarized" explicitly to see anything in the response. Either way, you are billed for the full thinking tokens Claude generated, not the summary. Billed output tokens will not match what you see in the response, which I found confusing the first time I checked my usage dashboard.

For conversations that do not need visible reasoning at all, you can set display: "omitted" to skip streaming thinking content entirely. You still pay for the thinking tokens, but time-to-first-text-token drops significantly because the server skips streaming the thinking blocks.

Controlling Depth with Effort

Not every problem deserves deep reasoning, and deep reasoning is not cheap. Effort levels let you tune how much Claude thinks:

Level	Behavior
`max`	Always thinks deeply, no constraints
`xhigh`	Always thinks deeply with extended exploration. Opus 4.7 only; recommended starting point for coding and agentic work
`high`	Always thinks (default)
`medium`	Moderate thinking, may skip for simple queries
`low`	Minimal thinking, prioritizes speed

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},
    messages=[
        {"role": "user", "content": "Classify this support ticket."}
    ],
)

I use medium as my default for agentic workflows where Claude makes many calls in sequence. Most of those calls are straightforward tool invocations that do not benefit from deep reasoning. Reserving high or max for genuinely hard steps keeps total token spend under control without sacrificing quality where it matters.

Tool Use During Thinking

In agentic workflows, extended thinking earns its cost through interleaved thinking and tool calls. Claude reasons, calls a tool, reasons about the result, calls another tool, then produces a final answer. Each thinking block is bounded by the budget you set, but the model can think multiple times in one response. On Sonnet 4.5 and earlier this requires the anthropic-beta: interleaved-thinking-2025-05-14 header. On Opus 4.6+ adaptive thinking enables it automatically.

This is where the explicit-toggle design pays off. You decide which agents deserve a reasoning scratchpad. A data-fetch agent that hits one endpoint and returns the result does not need it. A code-debugging agent walking a stack trace, grepping the codebase, then proposing a fix benefits noticeably. I default to enabling thinking on agents that touch multiple tools and disabling it on agents that do one obvious thing.

The protocol requirement is load-bearing: pass each thinking block back unchanged in the next request, including its signature field. The signature is an encrypted handle the server uses to reconstruct the prior reasoning. Strip or alter it and the API still returns 200, but thinking is silently disabled for that turn. There is no error, just a quiet quality drop. Detect it by checking whether thinking blocks come back in the response.

What Happens in Multi-Turn Conversations

Behavior here changed mid-2025 and the surface still surprises people. On older Opus and Sonnet models and all Haiku models, the API automatically strips prior-turn thinking blocks from later-turn context. On Opus 4.5+ and Sonnet 4.6+, prior thinking is kept by default; you opt into stripping with the clear_thinking_20251015 context-editing strategy.

The flagship-tier behavior buys continuity. Claude can reference the reasoning that led to a turn-3 decision when you push back in turn 7. The cost is that prior thinking tokens count against your later-turn context window. On Haiku and older flagships, the older stripping default trades that continuity for window savings. Pick deliberately based on the agent’s job.

Remove Manual CoT When Extended Thinking Is On

If you have been using chain of thought prompting (“think step by step,” “show your work,” “reason through this”), remove those instructions when you enable extended thinking. Having both active creates redundancy at best and conflict at worst. Claude will reason in its dedicated thinking scratchpad and then repeat or contradict that reasoning in the response because your prompt told it to show its work.

I spent an afternoon debugging exactly this problem. My prompt included “walk through your reasoning before answering” and I had extended thinking enabled. Claude was thinking thoroughly in the thinking block, then restating a compressed version of the same reasoning in the text block, sometimes reaching a different conclusion. Removing the prompt-level instruction and letting extended thinking handle all the reasoning fixed it immediately.

When to Use It

Extended thinking improves accuracy on problems that benefit from step-by-step reasoning: math, logic, code analysis, complex classification with many categories, and tasks where the model needs to consider multiple factors before committing to an answer. I have seen measurable accuracy gains on all of these.

Simple tasks do not benefit. “What is the capital of France?” does not need a reasoning scratchpad. Classification into 3 obvious categories does not need it. Adaptive thinking handles this automatically at high effort by skipping thinking when it is unnecessary, but at max effort Claude will always think regardless. Match effort level to task complexity, not to a blanket policy.

What This Sets Up

Extended thinking gives Claude internal reasoning before it responds. All three major vendors expose reasoning controls, but Anthropic’s design treats thinking as a first-class content type that flows through tool-use chains alongside text. That makes the cost-quality tradeoff explicit, at the price of a protocol detail (preserve the signature field) that breaks silently if you get it wrong.

Structured output, the topic of the next post in the series, constrains what that response looks like. Together they form a pattern I use constantly: think hard about the problem in the thinking block, then deliver a machine-readable answer in a guaranteed schema. Reasoning plus structure.