Claude Extended Thinking API Guide

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Extended thinking gives Claude a dedicated reasoning step before responding. This improves accuracy on complex tasks like math, coding, and multi-step analysis. All Claude models support it.

Quick Fix

Enable extended thinking with one parameter:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "What is 127 * 389?"}]
)

for block in response.content:
    if block.type == "thinking":
        print(f"Thinking: {block.thinking}")
    elif block.type == "text":
        print(f"Answer: {block.text}")

What You Need

Full Solution

Basic Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Explain why the sky is blue, step by step."}]
)

for block in response.content:
    if block.type == "thinking":
        print("=== Thinking ===")
        print(block.thinking)
        print("=== End Thinking ===\n")
    elif block.type == "text":
        print(block.text)

TypeScript Example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 16000,
  thinking: { type: "enabled", budget_tokens: 10000 },
  messages: [{ role: "user", content: "Solve: What is 127 * 389?" }]
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("Thinking:", block.thinking);
  } else if (block.type === "text") {
    console.log("Answer:", block.text);
  }
}

Display Options

Control how thinking content is returned:

# Summarized thinking (default for Claude 4 models)
# Returns a summary of the reasoning. Charged for full tokens.
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000, "display": "summarized"},
    messages=[...]
)

# Omitted thinking
# Returns empty thinking blocks with encrypted signature for continuity
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000, "display": "omitted"},
    messages=[...]
)

budget_tokens Rules

The budget_tokens parameter controls the maximum reasoning budget:

# Good: budget_tokens < max_tokens
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)

# Maximum output on Opus 4.6: 128k tokens
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=128000,
    thinking={"type": "enabled", "budget_tokens": 100000},
    messages=[...]
)

Extended Thinking with Tool Use

Thinking works with tools but only with tool_choice: auto or none:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "calculator",
        "description": "Perform arithmetic calculations",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression to evaluate"}
            },
            "required": ["expression"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    tool_choice={"type": "auto"},  # Only "auto" or "none" with thinking
    tools=tools,
    messages=[{"role": "user", "content": "Calculate the compound interest on $10,000 at 5% for 10 years"}]
)

Claude supports interleaved thinking between tool calls – it can think before deciding to call a tool, receive the result, think again, and then respond.

Multi-Turn Thinking Continuity

Pass thinking blocks back unmodified in multi-turn conversations to maintain reasoning continuity:

# First turn
response1 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Analyze the pros and cons of remote work"}]
)

# Build multi-turn messages -- include ALL content blocks
messages = [
    {"role": "user", "content": "Analyze the pros and cons of remote work"},
    {"role": "assistant", "content": response1.content},  # Includes thinking blocks
    {"role": "user", "content": "Now focus on the productivity aspect"}
]

# Second turn
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=messages
)

Streaming Extended Thinking

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Solve a complex problem..."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

message = stream.get_final_message()

Output Token Limits by Model

Model Max Output Tokens Max with Batch API
Claude Opus 4.6 128,000 300,000 (beta)
Claude Sonnet 4.6 64,000 300,000 (beta)
Claude Haiku 4.5 64,000 300,000 (beta)

Caching with Extended Thinking

Changing thinking parameters invalidates cached messages, but system prompts and tools remain cached:

# The system prompt cache survives thinking parameter changes
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    cache_control={"type": "ephemeral"},
    system="Large system prompt...",  # This stays cached
    messages=[...]  # These are re-processed if thinking params change
)

Prevention

  1. Set budget_tokens appropriately: Simple questions need 1024-4096. Complex reasoning needs 10000+. Always keep it less than max_tokens.
  2. Use auto for tool_choice: any and specific tool names are not supported with thinking.
  3. Never modify thinking blocks: Return them exactly as received in multi-turn conversations.
  4. Use 1-hour cache: For workloads with extended thinking, the 1-hour cache TTL avoids frequent cache invalidation from thinking parameter changes.