How Context Window Size Drives Claude API Bills

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

A Claude Opus 4.7 request with 500K tokens of context costs $2.50 in input alone. The same request with 30K tokens of relevant context costs $0.15. Context size — not model selection — is the single largest determinant of your API bill, often accounting for 70-90% of total spend.

The Setup

Developers frequently debate which Claude model to use without examining how much context they send per request. Switching from Opus to Sonnet saves 40%. Reducing context from 500K to 50K tokens saves 90%. The context lever is more powerful.

Claude’s token-based pricing means you pay for every token in the prompt, including system messages, conversation history, attached documents, and tool definitions. On long-running sessions, context accumulates silently until a single interaction costs more than the first 10 combined.

This guide shows how to audit your context consumption, identify the biggest context consumers, and implement controls that keep context costs manageable.

The Math

Context size impact on a single Opus 4.7 request:

Context Input Cost With 2K Output Total
5K tokens $0.025 $0.050 $0.075
50K tokens $0.250 $0.050 $0.300
200K tokens $1.000 $0.050 $1.050
500K tokens $2.500 $0.050 $2.550
1M tokens $5.000 $0.050 $5.050

The output cost ($0.050 for 2K tokens at $25/MTok) is constant. The input cost varies by 200x.

Monthly impact (100 requests/day):

Context Monthly Input Cost vs 50K Baseline
50K $750 Baseline
100K $1,500 +$750
200K $3,000 +$2,250
500K $7,500 +$6,750

Reducing average context from 200K to 50K saves $2,250/month — equivalent to switching from Opus to Haiku.

The Technique

Step 1: Audit Context Size Per Request

import anthropic
import json
from collections import defaultdict

client = anthropic.Anthropic()

context_log = defaultdict(list)

def logged_request(
    model: str,
    system: str,
    messages: list,
    request_type: str = "unknown",
    max_tokens: int = 4096,
    **kwargs,
) -> dict:
    """Make a request and log context size for analysis."""
    # Pre-count tokens
    count_kwargs = {"model": model, "messages": messages}
    if system:
        count_kwargs["system"] = system
    count = client.messages.count_tokens(**count_kwargs)

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=messages,
        **kwargs,
    )

    rates = {"claude-opus-4-7": 5.0, "claude-sonnet-4-6": 3.0, "claude-haiku-4-5-20251001": 1.0}
    input_rate = rates.get(model, 3.0)
    input_cost = count.input_tokens * input_rate / 1_000_000

    entry = {
        "request_type": request_type,
        "input_tokens": count.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "input_cost": input_cost,
        "model": model,
    }
    context_log[request_type].append(entry)

    return {
        "content": response.content[0].text,
        **entry,
    }

def generate_context_report() -> str:
    """Generate a cost report broken down by request type."""
    report = []
    for req_type, entries in context_log.items():
        avg_input = sum(e["input_tokens"] for e in entries) / len(entries)
        total_cost = sum(e["input_cost"] for e in entries)
        report.append({
            "type": req_type,
            "count": len(entries),
            "avg_input_tokens": int(avg_input),
            "total_input_cost": f"${total_cost:.4f}",
        })

    report.sort(key=lambda x: float(x["total_input_cost"].replace("$", "")), reverse=True)
    return json.dumps(report, indent=2)

Step 2: Identify and Reduce Context Bloat

def analyze_context_composition(messages: list) -> dict:
    """Break down where context tokens come from."""
    composition = {
        "code_blocks": 0,
        "error_traces": 0,
        "conversation": 0,
        "other": 0,
    }

    for msg in messages:
        content = msg["content"]
        lines = content.split("\n")

        in_code_block = False
        for line in lines:
            if line.strip().startswith("```"):
                in_code_block = not in_code_block
                continue

            words = len(line.split())
            tokens_estimate = words * 1.5  # rough estimate

            if in_code_block:
                composition["code_blocks"] += tokens_estimate
            elif any(sig in line for sig in ["Traceback", "Error:", "Warning:", "at "]):
                composition["error_traces"] += tokens_estimate
            else:
                composition["conversation"] += tokens_estimate

    total = sum(composition.values())
    percentages = {k: f"{v/max(total,1)*100:.1f}%" for k, v in composition.items()}

    return {"tokens": composition, "percentages": percentages, "total_estimated": int(total)}

# Example: Analyze a session's context
session_messages = [
    {"role": "user", "content": "Fix the bug in auth.py"},
    {"role": "assistant", "content": "Looking at auth.py...\n```python\ndef login(user, pw):\n    # ... 50 lines of code\n```\nI found the issue."},
    {"role": "user", "content": "That gave an error:\nTraceback (most recent call last):\n  File 'auth.py', line 42\n  ... 20 lines of traceback"},
]
analysis = analyze_context_composition(session_messages)
print(json.dumps(analysis, indent=2))

Step 3: Implement Context Size Controls

MAX_CONTEXT = {
    "code_review": 50_000,
    "chat": 20_000,
    "codebase_analysis": 200_000,
    "quick_question": 5_000,
}

def context_controlled_request(
    request_type: str,
    system: str,
    messages: list,
    model: str = "claude-sonnet-4-6",
) -> dict:
    """Enforce context size limits per request type."""
    max_tokens = MAX_CONTEXT.get(request_type, 50_000)

    count = client.messages.count_tokens(
        model=model, system=system, messages=messages,
    )

    if count.input_tokens > max_tokens:
        # Auto-prune: keep system + last 4 messages
        pruned_messages = messages[-4:]
        count = client.messages.count_tokens(
            model=model, system=system, messages=pruned_messages,
        )

        if count.input_tokens > max_tokens:
            return {"error": f"Context ({count.input_tokens}) exceeds limit ({max_tokens}) even after pruning"}

        messages = pruned_messages

    response = client.messages.create(
        model=model, max_tokens=4096,
        system=system, messages=messages,
    )

    return {
        "content": response.content[0].text,
        "context_tokens": count.input_tokens,
        "limit": max_tokens,
        "utilization": f"{count.input_tokens/max_tokens*100:.1f}%",
    }

The Tradeoffs

Reducing context can remove information Claude needs to give a complete answer. The risk is higher for tasks that require cross-referencing multiple parts of a large document or codebase.

Context size controls may frustrate users who expect Claude to “just know” everything in their project. Communicate context limits clearly and provide escape hatches for tasks that genuinely need large context.

The most expensive requests are often the most valuable — a $5.00 full-codebase analysis that finds a critical bug saves far more than it costs. Focus optimization on repetitive, high-volume requests, not occasional deep analysis.

Implementation Checklist

  1. Add context size logging to every API request
  2. Generate a weekly context size report by request type
  3. Identify the top 3 request types by total context token spend
  4. Set context size limits per request type
  5. Implement auto-pruning for requests that exceed limits
  6. Review context budgets monthly as workload changes

Measuring Impact

The primary metric is average context size per request, tracked weekly. Secondary: total monthly input token spend. Target reducing average context by 50% without increasing quality complaints by more than 5%. Plot context size distribution — if you have a long tail of 200K+ requests, those outliers may represent a disproportionate share of your budget.