Optimal Context Size for Cost-Efficient Claude
The cost-optimal context size for most Claude API tasks is 20,000-50,000 tokens. At this range, Opus 4.7 costs $0.10-$0.25 per request compared to $5.00 for a full 1M context — a 95% reduction. Most question-answering, code review, and document analysis tasks produce equivalent quality at 50K context versus 500K.
The Setup
Filling more context does not always improve results. Claude’s attention is not uniform across the context window — information at the beginning and end of the context gets more attention than content in the middle. After a certain threshold, adding more context adds noise rather than signal.
The optimal context size balances three factors: answer quality (more relevant context helps), cost (more tokens cost more), and latency (larger contexts take longer to process). This guide identifies the sweet spot for common task types.
The Math
Cost per request by context size on Opus 4.7:
| Context Size | Input Cost | Relative to 50K | Quality Impact |
|---|---|---|---|
| 10K tokens | $0.05 | 0.2x | Some tasks too small |
| 20K tokens | $0.10 | 0.4x | Good for focused tasks |
| 50K tokens | $0.25 | 1.0x (baseline) | Sweet spot for most tasks |
| 100K tokens | $0.50 | 2.0x | Marginal quality gain |
| 200K tokens | $1.00 | 4.0x | Diminishing returns |
| 500K tokens | $2.50 | 10.0x | Rarely justified |
| 1M tokens | $5.00 | 20.0x | Full codebase analysis only |
Monthly cost at 500 requests/day:
| Context Size | Monthly Cost | vs 50K Target |
|---|---|---|
| 50K (optimal) | $3,750 | Baseline |
| 100K | $7,500 | +$3,750 |
| 200K | $15,000 | +$11,250 |
| 500K | $37,500 | +$33,750 |
Moving from 200K average to 50K average saves $11,250/month.
The Technique
Task-Specific Context Budgets
import anthropic
client = anthropic.Anthropic()
# Optimal context sizes by task type (in tokens)
CONTEXT_BUDGETS = {
"classification": 5_000, # Label + short text
"entity_extraction": 10_000, # Document + extraction rules
"code_review": 30_000, # Changed files + nearby code
"summarization": 50_000, # Full document for summary
"question_answering": 20_000, # Retrieved passages + question
"code_generation": 30_000, # Specs + existing code patterns
"codebase_analysis": 200_000, # Multiple related files
"document_comparison": 100_000, # Two full documents
}
def budget_request(
task_type: str,
system: str,
content: str,
model: str = "claude-sonnet-4-6",
max_tokens: int = 4096,
) -> dict:
"""Send request with task-appropriate context budget."""
budget = CONTEXT_BUDGETS.get(task_type, 50_000)
# Estimate current content size
content_tokens_est = len(content.split()) * 2
if content_tokens_est > budget:
# Truncate to budget (smarter strategies below)
target_chars = budget * 4 # rough tokens to chars
content = content[:target_chars]
was_truncated = True
else:
was_truncated = False
response = client.messages.create(
model=model,
max_tokens=max_tokens,
system=system,
messages=[{"role": "user", "content": content}],
)
rates = {"claude-opus-4-7": 5.0, "claude-sonnet-4-6": 3.0, "claude-haiku-4-5-20251001": 1.0}
rate = rates.get(model, 3.0)
cost = response.usage.input_tokens * rate / 1_000_000
return {
"task_type": task_type,
"budget": budget,
"actual_tokens": response.usage.input_tokens,
"cost": f"${cost:.4f}",
"was_truncated": was_truncated,
"content": response.content[0].text,
}
Smart Truncation That Preserves Quality
def smart_truncate(content: str, budget_tokens: int, content_type: str = "code") -> str:
"""Truncate content intelligently based on type."""
target_chars = budget_tokens * 4
if len(content) <= target_chars:
return content
if content_type == "code":
return _truncate_code(content, target_chars)
elif content_type == "document":
return _truncate_document(content, target_chars)
else:
return content[:target_chars]
def _truncate_code(content: str, target_chars: int) -> str:
"""Keep imports, class/function signatures, and recent code."""
lines = content.split("\n")
# Priority 1: Imports and signatures (always keep)
priority_lines = []
body_lines = []
for line in lines:
stripped = line.strip()
if (stripped.startswith(("import ", "from ", "class ", "def ", "async def ",
"function ", "export ", "const ", "interface "))
or stripped.startswith("#") and len(stripped) < 100):
priority_lines.append(line)
else:
body_lines.append(line)
priority_text = "\n".join(priority_lines)
remaining_budget = target_chars - len(priority_text)
if remaining_budget > 0:
# Add body from the end (most recent code is most relevant)
body_text = "\n".join(body_lines)
if len(body_text) > remaining_budget:
body_text = "...\n" + body_text[-remaining_budget:]
return priority_text + "\n\n" + body_text
return priority_text[:target_chars]
def _truncate_document(content: str, target_chars: int) -> str:
"""Keep beginning (intro/summary) and end (conclusion)."""
half = target_chars // 2
return content[:half] + "\n\n[...content truncated...]\n\n" + content[-half:]
Finding Your Optimal Context Size Experimentally
def find_optimal_context(
full_content: str,
test_questions: list,
ground_truth: list,
model: str = "claude-sonnet-4-6",
context_sizes: list = None,
) -> dict:
"""Test different context sizes to find the optimal tradeoff."""
if context_sizes is None:
context_sizes = [5000, 10000, 20000, 50000, 100000, 200000]
results = []
for size in context_sizes:
truncated = smart_truncate(full_content, size, "document")
correct = 0
total_cost = 0
for question, truth in zip(test_questions, ground_truth):
response = client.messages.create(
model=model,
max_tokens=500,
system="Answer concisely based on the provided context.",
messages=[{"role": "user", "content": f"{truncated}\n\nQ: {question}"}],
)
if truth.lower() in response.content[0].text.lower():
correct += 1
total_cost += response.usage.input_tokens * 3.0 / 1_000_000
accuracy = correct / len(test_questions) * 100
avg_cost = total_cost / len(test_questions)
results.append({
"context_tokens": size,
"accuracy": f"{accuracy:.1f}%",
"avg_cost_per_query": f"${avg_cost:.4f}",
"cost_per_accuracy_point": f"${avg_cost / max(accuracy, 1):.6f}",
})
return results
The Tradeoffs
The 20-50K sweet spot applies to single-topic tasks. Multi-topic analysis, cross-referencing, and document comparison genuinely benefit from larger context. Do not force-fit these tasks into a small context budget.
Smart truncation can cut important information. Always test truncated prompts against your full-context baseline on representative queries before deploying.
Some tasks have hard context minimums. A 100-page legal contract analysis cannot be done in 20K tokens. Know your task requirements and set budgets accordingly.
Implementation Checklist
- Categorize your API requests by task type
- Set initial context budgets per category using the table above
- Implement smart truncation for each content type
- Run accuracy tests at budget versus full context on 50 queries
- Adjust budgets based on accuracy results
- Deploy and monitor quality and cost weekly
Measuring Impact
Plot a cost-accuracy curve for your workload: x-axis is context size, y-axis is answer quality. The optimal point is where the curve starts flattening — adding more context costs more but quality barely improves. For most Q&A and review tasks, this inflection point falls between 20K and 50K tokens.
Related Guides
- Why Is Claude Code Expensive — the economics of context-driven costs
- Claude Code Context Window Management Guide — managing context in practice
- Why Does Anthropic Limit Claude Code Context Window — design philosophy behind context limits