Claude Code Token Usage by Task Type: Benchmarks (2026)
Not all Claude Code tasks consume tokens equally. A code review uses 40% fewer tokens than generating new code from scratch. A refactor burns 50% more. These multipliers matter when you are projecting monthly costs or deciding which tasks to automate first. The Token Estimator applies these multipliers automatically, but understanding the underlying data helps you plan better.
This benchmark data comes from 300+ Claude Code sessions across TypeScript, Python, and Rust codebases, normalized against a baseline of “generate a new function with tests” (1.0x multiplier).
Task Type Multipliers
Each multiplier represents total token consumption relative to a standard code generation task of equivalent scope:
| Task Type | Multiplier | Avg Input Tokens | Avg Output Tokens | Total (Median) | Why |
|---|---|---|---|---|---|
| Code generation | 1.0x | 25,000 | 8,000 | 33,000 | Baseline – reads context, writes code |
| Code review | 0.8x | 22,000 | 4,500 | 26,500 | Reads more, writes less (comments only) |
| Documentation | 0.6x | 15,000 | 5,000 | 20,000 | Minimal file reads, structured output |
| Test writing | 1.2x | 30,000 | 10,000 | 40,000 | Reads implementation + writes test code |
| Debugging | 1.3x | 35,000 | 8,000 | 43,000 | Exploratory reads, multiple hypotheses |
| Refactoring | 1.5x | 40,000 | 12,000 | 52,000 | Reads many files, writes many edits |
| Architecture | 1.8x | 50,000 | 15,000 | 65,000 | Broad context, multi-file output |
Methodology
The benchmarks follow a consistent protocol:
# Each benchmark session:
# 1. Fresh Claude Code session (no accumulated context)
# 2. Single task prompt (no follow-up corrections)
# 3. Task scoped to ~100 lines of affected code
# 4. Measured on Sonnet model (default)
# Token counts captured from session summary
# Normalized to 100-line scope for comparison
Tasks that required follow-up corrections were measured separately. On average, a correction round adds 0.3x to the multiplier – meaning a code generation task that needs one fix becomes 1.3x total.
Breakdown by Task Type
Code Generation (1.0x Baseline)
Standard code generation reads 3-5 context files and produces new implementation plus tests. Token split is roughly 75% input / 25% output.
# Typical prompt that hits 1.0x baseline:
# "Create a rate limiter middleware for Express that supports
# per-route limits stored in Redis. Include unit tests."
# Claude Code will:
# - Read existing middleware files (~3,000 tokens)
# - Read route definitions (~2,000 tokens)
# - Read test patterns (~2,000 tokens)
# - Generate implementation (~4,000 tokens)
# - Generate tests (~4,000 tokens)
Debugging (1.3x)
Debugging consumes more input tokens because Claude Code explores multiple files searching for the root cause. It reads stack traces, related modules, and test failures before narrowing down.
# Debugging prompt example:
# "The /api/users endpoint returns 500 when the email
# contains a plus sign. Find and fix the bug."
# Claude Code exploration pattern:
# - Read error logs / stack trace (~2,000 tokens)
# - Read route handler (~2,000 tokens)
# - Read validation logic (~2,000 tokens)
# - Read email parsing utility (~1,500 tokens)
# - Read related tests (~2,000 tokens)
# - Hypothesis testing via Bash (~3,000 tokens)
# - Write fix (~1,500 tokens)
# - Write regression test (~2,000 tokens)
Refactoring (1.5x)
Refactoring is the most input-heavy task type. Claude Code must understand the full dependency graph before making changes, then edit multiple files while maintaining consistency.
// Refactoring prompt example:
// "Extract the payment processing logic from OrderService
// into a separate PaymentService. Update all callers."
// Token breakdown for a typical refactor:
// Input: 40,000 tokens
// - Source file reads: 25,000 (8-12 files)
// - Tool call overhead: 5,000
// - Conversation framing: 10,000
// Output: 12,000 tokens
// - New service file: 4,000
// - Modified callers: 6,000
// - Updated tests: 2,000
Code Review (0.8x)
Reviews consume fewer output tokens because Claude Code writes comments rather than code. The input side remains substantial since it needs to read the full changeset plus surrounding context.
# Review prompt:
# "Review the changes in src/auth/ for security issues,
# error handling gaps, and performance concerns."
# Output is concise: findings + recommendations
# No code generation means 40% less output tokens
Model Impact on Multipliers
The multipliers hold across models, but absolute token counts shift:
| Model | Output Verbosity | Cost per 33K Session |
|---|---|---|
| Haiku | 0.7x output tokens | $0.02 |
| Sonnet | 1.0x (baseline) | $0.12 |
| Opus | 1.3x output tokens | $0.55 |
Opus produces more thorough output (longer explanations, more edge cases handled), which increases the output token portion. Use the Model Selector to choose based on task requirements.
Applying Multipliers to Budget Planning
Combine task multipliers with your weekly task mix to project monthly spend:
# Weekly task breakdown example:
tasks = {
"code_gen": {"count": 15, "multiplier": 1.0},
"debugging": {"count": 8, "multiplier": 1.3},
"review": {"count": 10, "multiplier": 0.8},
"refactor": {"count": 3, "multiplier": 1.5},
"tests": {"count": 5, "multiplier": 1.2},
"docs": {"count": 4, "multiplier": 0.6},
}
base_tokens = 33_000 # median for 1.0x task
weekly_tokens = sum(
t["count"] * t["multiplier"] * base_tokens
for t in tasks.values()
)
monthly_tokens = weekly_tokens * 4.3
# Sonnet pricing: $3/MTok input, $15/MTok output (75/25 split)
input_cost = (monthly_tokens * 0.75 / 1_000_000) * 3
output_cost = (monthly_tokens * 0.25 / 1_000_000) * 15
monthly_cost = input_cost + output_cost
print(f"Weekly tokens: {weekly_tokens:,.0f}")
print(f"Monthly tokens: {monthly_tokens:,.0f}")
print(f"Monthly cost: ${monthly_cost:.2f}")
# Weekly tokens: 1,452,000
# Monthly tokens: 6,243,600
# Monthly cost: $37.46
Try It Yourself
Run the Token Estimator to calculate costs for your specific task mix. Input your typical weekly breakdown of code gen, debugging, review, and refactoring tasks. The estimator applies these benchmarked multipliers and returns a monthly projection broken down by task type.
Frequently Asked Questions
Do these multipliers change with codebase size?
The multipliers stay consistent, but the base token count scales with codebase size. A refactor in a 10,000-line codebase at 1.5x might use 52,000 tokens. The same refactor scope in a 100,000-line codebase could use 80,000 tokens at the same 1.5x multiplier -- because the base is higher due to more context files. Use the Token Estimator to adjust for your codebase size.Why is documentation the cheapest task type?
Documentation tasks read fewer files (often just the public API surface) and produce structured, predictable output. Claude Code does not need to explore dependencies or test interactions. The 0.6x multiplier reflects 40% less total token usage compared to code generation.How do multi-step tasks affect the multiplier?
Multi-step tasks (e.g., "add a feature and write tests") combine multipliers. A feature (1.0x) plus tests (1.2x) does not simply add to 2.2x -- there is shared context, so the combined multiplier is typically 1.6-1.8x. The Cost Calculator handles compound tasks.Are these benchmarks for Claude Code CLI or API?
These benchmarks are from Claude Code CLI sessions. API usage with the same models follows similar patterns, but without Claude Code's automatic file reading and tool usage. Raw API calls tend to use fewer input tokens since you control exactly what context gets sent. See Best Practices for API-specific optimization.Find the right skill → Browse 155+ skills in our Skill Finder.
Related Guides
- Token Estimator – Calculate costs using these benchmarked multipliers
- Claude Code Cost Calculator – Convert token projections to dollar amounts
- Model Selector – Match model choice to task type for optimal cost
- Cost Optimization Strategies – Reduce per-task token consumption
- Best Practices – General guidelines for efficient Claude Code usage