Optimal Context Size for Cost-Efficient Claude

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

The cost-optimal context size for most Claude API tasks is 20,000-50,000 tokens. At this range, Opus 4.7 costs $0.10-$0.25 per request compared to $5.00 for a full 1M context — a 95% reduction. Most question-answering, code review, and document analysis tasks produce equivalent quality at 50K context versus 500K.

The Setup

Filling more context does not always improve results. Claude’s attention is not uniform across the context window — information at the beginning and end of the context gets more attention than content in the middle. After a certain threshold, adding more context adds noise rather than signal.

The optimal context size balances three factors: answer quality (more relevant context helps), cost (more tokens cost more), and latency (larger contexts take longer to process). This guide identifies the sweet spot for common task types.

The Math

Cost per request by context size on Opus 4.7:

Context Size Input Cost Relative to 50K Quality Impact
10K tokens $0.05 0.2x Some tasks too small
20K tokens $0.10 0.4x Good for focused tasks
50K tokens $0.25 1.0x (baseline) Sweet spot for most tasks
100K tokens $0.50 2.0x Marginal quality gain
200K tokens $1.00 4.0x Diminishing returns
500K tokens $2.50 10.0x Rarely justified
1M tokens $5.00 20.0x Full codebase analysis only

Monthly cost at 500 requests/day:

Context Size Monthly Cost vs 50K Target
50K (optimal) $3,750 Baseline
100K $7,500 +$3,750
200K $15,000 +$11,250
500K $37,500 +$33,750

Moving from 200K average to 50K average saves $11,250/month.

The Technique

Task-Specific Context Budgets

import anthropic

client = anthropic.Anthropic()

# Optimal context sizes by task type (in tokens)
CONTEXT_BUDGETS = {
    "classification": 5_000,       # Label + short text
    "entity_extraction": 10_000,   # Document + extraction rules
    "code_review": 30_000,         # Changed files + nearby code
    "summarization": 50_000,       # Full document for summary
    "question_answering": 20_000,  # Retrieved passages + question
    "code_generation": 30_000,     # Specs + existing code patterns
    "codebase_analysis": 200_000,  # Multiple related files
    "document_comparison": 100_000, # Two full documents
}

def budget_request(
    task_type: str,
    system: str,
    content: str,
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 4096,
) -> dict:
    """Send request with task-appropriate context budget."""
    budget = CONTEXT_BUDGETS.get(task_type, 50_000)

    # Estimate current content size
    content_tokens_est = len(content.split()) * 2

    if content_tokens_est > budget:
        # Truncate to budget (smarter strategies below)
        target_chars = budget * 4  # rough tokens to chars
        content = content[:target_chars]
        was_truncated = True
    else:
        was_truncated = False

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": content}],
    )

    rates = {"claude-opus-4-7": 5.0, "claude-sonnet-4-6": 3.0, "claude-haiku-4-5-20251001": 1.0}
    rate = rates.get(model, 3.0)
    cost = response.usage.input_tokens * rate / 1_000_000

    return {
        "task_type": task_type,
        "budget": budget,
        "actual_tokens": response.usage.input_tokens,
        "cost": f"${cost:.4f}",
        "was_truncated": was_truncated,
        "content": response.content[0].text,
    }

Smart Truncation That Preserves Quality

def smart_truncate(content: str, budget_tokens: int, content_type: str = "code") -> str:
    """Truncate content intelligently based on type."""
    target_chars = budget_tokens * 4

    if len(content) <= target_chars:
        return content

    if content_type == "code":
        return _truncate_code(content, target_chars)
    elif content_type == "document":
        return _truncate_document(content, target_chars)
    else:
        return content[:target_chars]

def _truncate_code(content: str, target_chars: int) -> str:
    """Keep imports, class/function signatures, and recent code."""
    lines = content.split("\n")

    # Priority 1: Imports and signatures (always keep)
    priority_lines = []
    body_lines = []

    for line in lines:
        stripped = line.strip()
        if (stripped.startswith(("import ", "from ", "class ", "def ", "async def ",
                                 "function ", "export ", "const ", "interface "))
            or stripped.startswith("#") and len(stripped) < 100):
            priority_lines.append(line)
        else:
            body_lines.append(line)

    priority_text = "\n".join(priority_lines)
    remaining_budget = target_chars - len(priority_text)

    if remaining_budget > 0:
        # Add body from the end (most recent code is most relevant)
        body_text = "\n".join(body_lines)
        if len(body_text) > remaining_budget:
            body_text = "...\n" + body_text[-remaining_budget:]
        return priority_text + "\n\n" + body_text

    return priority_text[:target_chars]

def _truncate_document(content: str, target_chars: int) -> str:
    """Keep beginning (intro/summary) and end (conclusion)."""
    half = target_chars // 2
    return content[:half] + "\n\n[...content truncated...]\n\n" + content[-half:]

Finding Your Optimal Context Size Experimentally

def find_optimal_context(
    full_content: str,
    test_questions: list,
    ground_truth: list,
    model: str = "claude-sonnet-4-6",
    context_sizes: list = None,
) -> dict:
    """Test different context sizes to find the optimal tradeoff."""
    if context_sizes is None:
        context_sizes = [5000, 10000, 20000, 50000, 100000, 200000]

    results = []
    for size in context_sizes:
        truncated = smart_truncate(full_content, size, "document")

        correct = 0
        total_cost = 0
        for question, truth in zip(test_questions, ground_truth):
            response = client.messages.create(
                model=model,
                max_tokens=500,
                system="Answer concisely based on the provided context.",
                messages=[{"role": "user", "content": f"{truncated}\n\nQ: {question}"}],
            )

            if truth.lower() in response.content[0].text.lower():
                correct += 1
            total_cost += response.usage.input_tokens * 3.0 / 1_000_000

        accuracy = correct / len(test_questions) * 100
        avg_cost = total_cost / len(test_questions)

        results.append({
            "context_tokens": size,
            "accuracy": f"{accuracy:.1f}%",
            "avg_cost_per_query": f"${avg_cost:.4f}",
            "cost_per_accuracy_point": f"${avg_cost / max(accuracy, 1):.6f}",
        })

    return results

The Tradeoffs

The 20-50K sweet spot applies to single-topic tasks. Multi-topic analysis, cross-referencing, and document comparison genuinely benefit from larger context. Do not force-fit these tasks into a small context budget.

Smart truncation can cut important information. Always test truncated prompts against your full-context baseline on representative queries before deploying.

Some tasks have hard context minimums. A 100-page legal contract analysis cannot be done in 20K tokens. Know your task requirements and set budgets accordingly.

Implementation Checklist

  1. Categorize your API requests by task type
  2. Set initial context budgets per category using the table above
  3. Implement smart truncation for each content type
  4. Run accuracy tests at budget versus full context on 50 queries
  5. Adjust budgets based on accuracy results
  6. Deploy and monitor quality and cost weekly

Measuring Impact

Plot a cost-accuracy curve for your workload: x-axis is context size, y-axis is answer quality. The optimal point is where the curve starts flattening — adding more context costs more but quality barely improves. For most Q&A and review tasks, this inflection point falls between 20K and 50K tokens.