Smart Context Pruning for Claude API Savings

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

A Claude Opus 4.7 session with 200K tokens of accumulated context costs $1.00 per interaction. Pruning that context to 80K tokens — by removing diagnostic noise, deduplicating code blocks, and summarizing old turns — drops the cost to $0.40 per interaction. At 100 interactions per day, that saves $1,800/month.

The Setup

Context pruning removes tokens that do not contribute to Claude’s ability to generate a useful response. In a typical Claude Code session, 40-60% of accumulated context consists of superseded code versions, resolved error traces, build logs, and redundant conversation turns.

Pruning differs from truncation. Truncation blindly removes content from the beginning or end. Pruning selectively removes low-value content while preserving high-value content regardless of its position in the conversation.

This guide covers four pruning strategies ordered by impact, with code implementations and cost calculations for each.

The Math

Before pruning (200K context, Opus 4.7):

After Strategy 1 — Noise removal (removes 30%):

After Strategy 2 — Code deduplication (removes additional 15%):

After Strategy 3 — History summarization (removes additional 15%):

After Strategy 4 — Output truncation (removes additional 10%):

Total savings: $2,100/month (70%)

Scale to 500 interactions/day: savings of $10,500/month.

The Technique

Strategy 1: Remove Diagnostic Noise (Highest Impact)

Build output, deprecation warnings, and verbose error traces accumulate quickly and provide no value once the issue is resolved.

import re
from typing import list

NOISE_PATTERNS = [
    r"npm WARN .*",
    r"warning:.*deprecated.*",
    r"^\s*at\s+\w+.*\(.*:\d+:\d+\)$",  # Stack trace frames
    r"^\s*at\s+Object\.<anonymous>.*",
    r"Compiling\s+\w+.*",
    r"Building\s+.*",
    r"^\s*\d+\s+(verbose|silly|http)\s+.*",  # npm verbose logs
    r"DEPRECATION WARNING.*",
    r"^\s*~+$",  # Underline decorations
]

compiled_patterns = [re.compile(p) for p in NOISE_PATTERNS]

def remove_noise(content: str) -> tuple:
    """Remove diagnostic noise lines from content."""
    lines = content.split("\n")
    clean_lines = []
    removed_count = 0

    for line in lines:
        is_noise = any(p.search(line) for p in compiled_patterns)
        if is_noise:
            removed_count += 1
        else:
            clean_lines.append(line)

    if removed_count > 0:
        clean_lines.append(f"[{removed_count} diagnostic lines removed]")

    return "\n".join(clean_lines), removed_count

def prune_messages_noise(messages: list) -> list:
    """Apply noise removal to all messages."""
    total_removed = 0
    cleaned = []
    for msg in messages:
        clean_content, removed = remove_noise(msg["content"])
        total_removed += removed
        cleaned.append({**msg, "content": clean_content})
    return cleaned, total_removed

Strategy 2: Deduplicate Code Blocks

When Claude iterates on code, the context accumulates multiple versions of the same file. Only the latest version matters.

import hashlib

def deduplicate_code_blocks(messages: list) -> list:
    """Keep only the latest version of each code block."""
    # First pass: find all code blocks and their positions
    code_versions = {}  # filename -> list of (msg_index, content_hash, content)

    for i, msg in enumerate(messages):
        content = msg["content"]
        # Extract file references and code blocks
        blocks = re.findall(r"(?:File:\s*`?([^`\n]+)`?\n)?```[\w]*\n(.*?)```", content, re.DOTALL)

        for filename, code in blocks:
            if not filename:
                filename = f"unnamed_block_{hashlib.md5(code[:100].encode()).hexdigest()[:6]}"
            code_hash = hashlib.md5(code.strip().encode()).hexdigest()

            if filename not in code_versions:
                code_versions[filename] = []
            code_versions[filename].append({"msg_idx": i, "hash": code_hash})

    # Identify outdated versions (all except the last occurrence of each file)
    outdated_hashes = set()
    for filename, versions in code_versions.items():
        if len(versions) > 1:
            for v in versions[:-1]:  # All except last
                outdated_hashes.add((v["msg_idx"], v["hash"]))

    # Second pass: replace outdated code blocks with references
    cleaned = []
    tokens_saved = 0
    for i, msg in enumerate(messages):
        content = msg["content"]
        blocks = re.findall(r"```[\w]*\n(.*?)```", content, re.DOTALL)

        for code in blocks:
            code_hash = hashlib.md5(code.strip().encode()).hexdigest()
            if (i, code_hash) in outdated_hashes:
                placeholder = "[superseded code version removed]"
                content = content.replace(f"```\n{code}```", placeholder, 1)
                content = content.replace(f"```python\n{code}```", placeholder, 1)
                tokens_saved += len(code) // 4

        cleaned.append({**msg, "content": content})

    return cleaned, tokens_saved

Strategy 3: Summarize Old Conversation History

import anthropic

client = anthropic.Anthropic()

def summarize_old_turns(
    messages: list,
    keep_recent: int = 8,
    model: str = "claude-haiku-4-5-20251001",
) -> list:
    """Replace old conversation turns with a concise summary."""
    if len(messages) <= keep_recent:
        return messages

    old_messages = messages[:-keep_recent]
    recent_messages = messages[-keep_recent:]

    # Build summary of old messages
    old_text = "\n".join(
        f"{m['role']}: {m['content'][:500]}" for m in old_messages
    )

    summary_resp = client.messages.create(
        model=model,  # Use Haiku for cheap summarization
        max_tokens=800,
        system="Summarize this conversation concisely. Preserve: decisions made, files changed, current task state, unresolved items. Max 250 words.",
        messages=[{"role": "user", "content": old_text}],
    )

    summary = summary_resp.content[0].text

    return [
        {"role": "user", "content": f"[Summary of previous {len(old_messages)} messages: {summary}]"},
        {"role": "assistant", "content": "Understood. Continuing with this context."},
    ] + recent_messages

Strategy 4: Truncate Long Tool Outputs

def truncate_tool_outputs(messages: list, max_output_lines: int = 40) -> list:
    """Truncate verbose tool outputs while keeping key information."""
    cleaned = []
    for msg in messages:
        content = msg["content"]
        lines = content.split("\n")

        if len(lines) > max_output_lines:
            # Keep first 15 and last 15 lines, add truncation marker
            head = lines[:15]
            tail = lines[-15:]
            truncated = len(lines) - 30
            content = "\n".join(head + [f"\n[...{truncated} lines truncated...]\n"] + tail)

        cleaned.append({**msg, "content": content})
    return cleaned

Complete Pruning Pipeline

def full_prune_pipeline(messages: list) -> dict:
    """Apply all pruning strategies in order of impact."""
    original_tokens = sum(len(m["content"]) // 4 for m in messages)

    # Strategy 1: Noise removal
    messages, noise_removed = prune_messages_noise(messages)

    # Strategy 2: Code deduplication
    messages, code_tokens_saved = deduplicate_code_blocks(messages)

    # Strategy 3: Truncate outputs
    messages = truncate_tool_outputs(messages)

    # Strategy 4: Summarize old turns
    messages = summarize_old_turns(messages)

    final_tokens = sum(len(m["content"]) // 4 for m in messages)
    reduction = (1 - final_tokens / max(original_tokens, 1)) * 100

    return {
        "messages": messages,
        "original_tokens": original_tokens,
        "final_tokens": final_tokens,
        "reduction": f"{reduction:.1f}%",
        "opus_savings_per_request": f"${(original_tokens - final_tokens) * 5 / 1_000_000:.4f}",
    }

The Tradeoffs

Pruning is irreversible within a session. If Claude later needs information that was pruned, you must re-provide it. Mitigate by pruning conservatively (start with noise removal only) and expanding pruning strategies gradually.

Summarization via Haiku adds cost ($0.001-$0.01 per summary) and latency (0.5-2 seconds). This is negligible compared to the savings but worth tracking.

Aggressive code deduplication can remove intermediate versions that showed Claude’s reasoning. If debugging Claude’s approach, keep more versions.

Implementation Checklist

  1. Implement noise removal first (zero risk, highest impact)
  2. Add code deduplication for coding-heavy sessions
  3. Implement output truncation at 40 lines
  4. Add history summarization for sessions longer than 15 turns
  5. Monitor context size before and after pruning
  6. Track quality impact on response usefulness
  7. Tune thresholds based on your specific workflow

Measuring Impact

Measure average context size per interaction with and without pruning. Target a 50-70% reduction. Calculate monthly cost savings using the formula: (original_tokens - pruned_tokens) * interactions * rate / 1,000,000 * 30. Track the re-explanation rate (how often you need to re-provide pruned information) — keep this under 5%.