Extended thinking is a research budget, not a default

1 The default-on trap.

The default-on trap.

Teams adopt extended thinking the same way they adopted GPU acceleration a decade ago: turn it on everywhere, assume more compute is always better, and watch the bill climb while quality refuses to move.

The failure mode is consistent. An engineer reads the docs, flips the parameter from off to a generous budget, and fans it out across every call the service makes. A week later the team is paying three to five times what they paid the prior week, on-call reports slightly slower P95 latency, and nobody can point to a task that got better. The tickets that used to be fast are fast. The tickets that were bad are still bad.

The issue is not the feature. Extended thinking is a real capability and it earns its keep on real work. The issue is that teams treat it as a global switch instead of a per-task budget, and they measure cost without measuring the quality delta that would justify the cost.

We write this playbook so that nobody has to learn the default-on trap by paying for it.

2 What extended thinking actually buys.

What extended thinking actually buys.

It buys the model room to plan before it writes the answer. That room shows up in your usage as extra output tokens you pay for but never see in the final response.

The mechanism is simple to describe and easy to misuse. When extended thinking is on, the model produces a private reasoning trace before producing the answer the user receives. You do not get the reasoning tokens back in your response. You do pay for them, at output-token rates. The size of that private trace scales with the budget you set.

For tasks that benefit from planning, this changes outcomes in measurable ways. We see step-through correctness on multi-hypothesis problems move from around 70 to 75 percent up into the high 80s. We see tool-use sequences stop double-booking the same tool. We see architectural recommendations that consider two or three viable options before committing, instead of the first idea dressed up with confidence.

For tasks that do not benefit, the trace is overhead. You pay for 600 tokens of internal deliberation to produce 40 tokens of "yes, route to billing," and you have bought nothing.

3 The four tasks where it pays.

The four tasks where it pays.

We turn extended thinking on for four categories of work. Not five. Not "whenever it feels complex." Four.

Multi-hypothesis analysis.

Tasks where the correct answer requires considering several candidate explanations and ruling some out. Incident triage. Attack-path reasoning in security findings. Root-cause investigation in distributed systems. Differential diagnosis-shaped work in any domain. Without thinking room, the model picks the first plausible story and defends it. With thinking room, it enumerates and prunes.

Multi-step tool orchestration.

Agentic flows where the model decides which tool to call next based on the result of the last call. If the task takes four or more tool calls and mistakes compound, extended thinking pays for itself by cutting the wasted-tool-call rate. We have measured roughly a 40 percent reduction in redundant calls on our internal research agent, which more than covers the thinking-token cost.

Code that has to fit a system.

Refactors across a large file, changes that have to respect a schema and a test suite, migrations that need to preserve invariants. Not greenfield code. Not "write me a function that does X." Specifically code that must fit, and where the model is asked to reason about fit before writing.

Long-form editorial with structure constraints.

Document generation where the output has an enforced shape: specific sections, required evidence, a claim that has to be defended across paragraphs. Our Signature Handbook generation uses extended thinking for the analysis chapters specifically because those chapters have to argue a position, not just describe one. We do not use it for the appendix that lists environment variables.

4 The four tasks where it bleeds.

The four tasks where it bleeds.

The pattern is symmetric. If the shape of the task does not match the categories above, extended thinking is almost certainly a line item with no return.

Classification with a short taxonomy.

"Is this ticket billing, abuse, feature request, or other." Haiku without thinking solves this at 94 percent accuracy on our data. Opus with a generous thinking budget solves it at 95 percent, for roughly thirty times the cost. The one percentage point is not nothing, but it is not worth the multiplier. Spend the money on a better labeling pipeline and train the cheap model better.

Template filling.

Form letters, status updates, polite rejections, meeting-invite drafts, onboarding emails. The model is being asked to style a fixed payload. There is no analysis. Thinking tokens here are pure waste.

Retrieval-bound question answering.

When the answer lives in the retrieved context and the task is "quote or summarize what is in the context," thinking does not help. In fact it occasionally hurts, because the model will invent a framing instead of quoting the source. Turn it off. Invest in retrieval quality instead.

Anything latency-sensitive under 800 ms.

Interactive UX. Typeahead. Live chat where the user is watching a cursor blink. Extended thinking adds latency proportional to the budget, and for these flows any added latency is visible. We ship these on Haiku without thinking and optimize the retrieval and prompt cache instead.

5 Budget design: token caps as a first-class control.

Budget design: token caps as a first-class control.

When we do turn thinking on, we set a cap and we measure. "On" is not a budget. A budget is a number, a task, and a scorer that tells us whether the number is right.

Our default starting points, tuned on our own pipeline:

Multi-hypothesis analysis: start at 4,000 thinking tokens, measure, tune up or down.
Tool orchestration (4+ tools): start at 6,000, give the model room to plan sequences.
Structural code changes: start at 3,000, bump to 8,000 only on scorer regression.
Long-form editorial with structure: start at 5,000, cap at 10,000.

The cap matters more than the target. A run-away thinking budget can 10x your bill on a single pathological input. Always set an upper bound. Always alert when you hit it.

The scorer matters even more than the cap. If you cannot describe how you will measure whether thinking helped, do not turn it on. Pick a task, define pass criteria, run 50 examples with thinking off, run 50 with thinking on, compare. If the delta is under three points and the cost is over 2x, you just failed your own ROI test. Turn it off.

6 A three-tier policy we actually ship.

A three-tier policy we actually ship.

Here is the policy we embed in every client's product development handbook. It is boring and it works.

Tier 1: Interactive. Thinking off.

Anything a human is watching in real time. Chat, typeahead, inline suggestions, UI copy rewrites, email drafts. Latency budget under 1.5 seconds end-to-end. Model: Haiku or Sonnet. Thinking tokens: zero.

Tier 2: Background. Thinking conditional.

Async jobs. Report generation, ticket classification at scale, data enrichment, summarization. Default thinking off. Turn on only for the subset of jobs where offline A/B measured a real quality win. Cap at 4,000 tokens. Model: Sonnet with Opus escalation only on measured need.

Tier 3: Deliberative. Thinking on with a budget.

Analysis work that a human operator will review before it ships: security findings, architectural recommendations, editorial drafts, tool-orchestrated research. Opus with extended thinking, budget set per task per measurement. This is the narrow tier where the feature earns its cost. It is also the tier with the lowest call volume.

Call-volume distribution on a typical engagement: tier 1 is roughly 80 percent of volume, tier 2 is 18 percent, tier 3 is 2 percent. The thinking budget is concentrated in the 2 percent. That is the design. Anything else is the default-on trap wearing a costume.

Design the tiers first. Set the budgets second. Measure the quality delta third. Do not turn the parameter on because the task feels hard. Turn it on because you have a scorer that told you it helps.

→ Related work

Sample S8

Eval harness example →

The scoring pipeline we use to decide whether a thinking budget is paying for itself.

Guide G5.1

Shipping AI features safely →

Cost modeling, eval design, and the tier discipline this essay embeds.

Service

Product development →

How we stand this up inside an engagement, including the tier-policy handbook chapter.