How Context Window Size Drives Claude API Bills
A Claude Opus 4.7 request with 500K tokens of context costs $2.50 in input alone. The same request with 30K tokens of relevant context costs $0.15. Context size — not model selection — is the single largest determinant of your API bill, often accounting for 70-90% of total spend.
The Setup
Developers frequently debate which Claude model to use without examining how much context they send per request. Switching from Opus to Sonnet saves 40%. Reducing context from 500K to 50K tokens saves 90%. The context lever is more powerful.
Claude’s token-based pricing means you pay for every token in the prompt, including system messages, conversation history, attached documents, and tool definitions. On long-running sessions, context accumulates silently until a single interaction costs more than the first 10 combined.
This guide shows how to audit your context consumption, identify the biggest context consumers, and implement controls that keep context costs manageable.
The Math
Context size impact on a single Opus 4.7 request:
| Context | Input Cost | With 2K Output | Total |
|---|---|---|---|
| 5K tokens | $0.025 | $0.050 | $0.075 |
| 50K tokens | $0.250 | $0.050 | $0.300 |
| 200K tokens | $1.000 | $0.050 | $1.050 |
| 500K tokens | $2.500 | $0.050 | $2.550 |
| 1M tokens | $5.000 | $0.050 | $5.050 |
The output cost ($0.050 for 2K tokens at $25/MTok) is constant. The input cost varies by 200x.
Monthly impact (100 requests/day):
| Context | Monthly Input Cost | vs 50K Baseline |
|---|---|---|
| 50K | $750 | Baseline |
| 100K | $1,500 | +$750 |
| 200K | $3,000 | +$2,250 |
| 500K | $7,500 | +$6,750 |
Reducing average context from 200K to 50K saves $2,250/month — equivalent to switching from Opus to Haiku.
The Technique
Step 1: Audit Context Size Per Request
import anthropic
import json
from collections import defaultdict
client = anthropic.Anthropic()
context_log = defaultdict(list)
def logged_request(
model: str,
system: str,
messages: list,
request_type: str = "unknown",
max_tokens: int = 4096,
**kwargs,
) -> dict:
"""Make a request and log context size for analysis."""
# Pre-count tokens
count_kwargs = {"model": model, "messages": messages}
if system:
count_kwargs["system"] = system
count = client.messages.count_tokens(**count_kwargs)
response = client.messages.create(
model=model,
max_tokens=max_tokens,
system=system,
messages=messages,
**kwargs,
)
rates = {"claude-opus-4-7": 5.0, "claude-sonnet-4-6": 3.0, "claude-haiku-4-5-20251001": 1.0}
input_rate = rates.get(model, 3.0)
input_cost = count.input_tokens * input_rate / 1_000_000
entry = {
"request_type": request_type,
"input_tokens": count.input_tokens,
"output_tokens": response.usage.output_tokens,
"input_cost": input_cost,
"model": model,
}
context_log[request_type].append(entry)
return {
"content": response.content[0].text,
**entry,
}
def generate_context_report() -> str:
"""Generate a cost report broken down by request type."""
report = []
for req_type, entries in context_log.items():
avg_input = sum(e["input_tokens"] for e in entries) / len(entries)
total_cost = sum(e["input_cost"] for e in entries)
report.append({
"type": req_type,
"count": len(entries),
"avg_input_tokens": int(avg_input),
"total_input_cost": f"${total_cost:.4f}",
})
report.sort(key=lambda x: float(x["total_input_cost"].replace("$", "")), reverse=True)
return json.dumps(report, indent=2)
Step 2: Identify and Reduce Context Bloat
def analyze_context_composition(messages: list) -> dict:
"""Break down where context tokens come from."""
composition = {
"code_blocks": 0,
"error_traces": 0,
"conversation": 0,
"other": 0,
}
for msg in messages:
content = msg["content"]
lines = content.split("\n")
in_code_block = False
for line in lines:
if line.strip().startswith("```"):
in_code_block = not in_code_block
continue
words = len(line.split())
tokens_estimate = words * 1.5 # rough estimate
if in_code_block:
composition["code_blocks"] += tokens_estimate
elif any(sig in line for sig in ["Traceback", "Error:", "Warning:", "at "]):
composition["error_traces"] += tokens_estimate
else:
composition["conversation"] += tokens_estimate
total = sum(composition.values())
percentages = {k: f"{v/max(total,1)*100:.1f}%" for k, v in composition.items()}
return {"tokens": composition, "percentages": percentages, "total_estimated": int(total)}
# Example: Analyze a session's context
session_messages = [
{"role": "user", "content": "Fix the bug in auth.py"},
{"role": "assistant", "content": "Looking at auth.py...\n```python\ndef login(user, pw):\n # ... 50 lines of code\n```\nI found the issue."},
{"role": "user", "content": "That gave an error:\nTraceback (most recent call last):\n File 'auth.py', line 42\n ... 20 lines of traceback"},
]
analysis = analyze_context_composition(session_messages)
print(json.dumps(analysis, indent=2))
Step 3: Implement Context Size Controls
MAX_CONTEXT = {
"code_review": 50_000,
"chat": 20_000,
"codebase_analysis": 200_000,
"quick_question": 5_000,
}
def context_controlled_request(
request_type: str,
system: str,
messages: list,
model: str = "claude-sonnet-4-6",
) -> dict:
"""Enforce context size limits per request type."""
max_tokens = MAX_CONTEXT.get(request_type, 50_000)
count = client.messages.count_tokens(
model=model, system=system, messages=messages,
)
if count.input_tokens > max_tokens:
# Auto-prune: keep system + last 4 messages
pruned_messages = messages[-4:]
count = client.messages.count_tokens(
model=model, system=system, messages=pruned_messages,
)
if count.input_tokens > max_tokens:
return {"error": f"Context ({count.input_tokens}) exceeds limit ({max_tokens}) even after pruning"}
messages = pruned_messages
response = client.messages.create(
model=model, max_tokens=4096,
system=system, messages=messages,
)
return {
"content": response.content[0].text,
"context_tokens": count.input_tokens,
"limit": max_tokens,
"utilization": f"{count.input_tokens/max_tokens*100:.1f}%",
}
The Tradeoffs
Reducing context can remove information Claude needs to give a complete answer. The risk is higher for tasks that require cross-referencing multiple parts of a large document or codebase.
Context size controls may frustrate users who expect Claude to “just know” everything in their project. Communicate context limits clearly and provide escape hatches for tasks that genuinely need large context.
The most expensive requests are often the most valuable — a $5.00 full-codebase analysis that finds a critical bug saves far more than it costs. Focus optimization on repetitive, high-volume requests, not occasional deep analysis.
Implementation Checklist
- Add context size logging to every API request
- Generate a weekly context size report by request type
- Identify the top 3 request types by total context token spend
- Set context size limits per request type
- Implement auto-pruning for requests that exceed limits
- Review context budgets monthly as workload changes
Measuring Impact
The primary metric is average context size per request, tracked weekly. Secondary: total monthly input token spend. Target reducing average context by 50% without increasing quality complaints by more than 5%. Plot context size distribution — if you have a long tail of 200K+ requests, those outliers may represent a disproportionate share of your budget.
Related Guides
- Why Is Claude Code Expensive — the original analysis of context-driven costs
- Claude Code Context Window Management Guide — tactical context management strategies
- Why Does Anthropic Limit Claude Code Context Window — understanding context limits