Smart Context Pruning for Claude API Savings
A Claude Opus 4.7 session with 200K tokens of accumulated context costs $1.00 per interaction. Pruning that context to 80K tokens — by removing diagnostic noise, deduplicating code blocks, and summarizing old turns — drops the cost to $0.40 per interaction. At 100 interactions per day, that saves $1,800/month.
The Setup
Context pruning removes tokens that do not contribute to Claude’s ability to generate a useful response. In a typical Claude Code session, 40-60% of accumulated context consists of superseded code versions, resolved error traces, build logs, and redundant conversation turns.
Pruning differs from truncation. Truncation blindly removes content from the beginning or end. Pruning selectively removes low-value content while preserving high-value content regardless of its position in the conversation.
This guide covers four pruning strategies ordered by impact, with code implementations and cost calculations for each.
The Math
Before pruning (200K context, Opus 4.7):
- Per interaction: 200K * $5.00/MTok = $1.00
- 100 interactions/day: $100/day -> $3,000/month
After Strategy 1 — Noise removal (removes 30%):
- Context: 140K tokens -> $0.70/interaction
- Monthly: $2,100 (30% savings)
After Strategy 2 — Code deduplication (removes additional 15%):
- Context: 110K tokens -> $0.55/interaction
- Monthly: $1,650 (45% savings)
After Strategy 3 — History summarization (removes additional 15%):
- Context: 80K tokens -> $0.40/interaction
- Monthly: $1,200 (60% savings)
After Strategy 4 — Output truncation (removes additional 10%):
- Context: 60K tokens -> $0.30/interaction
- Monthly: $900 (70% savings)
Total savings: $2,100/month (70%)
Scale to 500 interactions/day: savings of $10,500/month.
The Technique
Strategy 1: Remove Diagnostic Noise (Highest Impact)
Build output, deprecation warnings, and verbose error traces accumulate quickly and provide no value once the issue is resolved.
import re
from typing import list
NOISE_PATTERNS = [
r"npm WARN .*",
r"warning:.*deprecated.*",
r"^\s*at\s+\w+.*\(.*:\d+:\d+\)$", # Stack trace frames
r"^\s*at\s+Object\.<anonymous>.*",
r"Compiling\s+\w+.*",
r"Building\s+.*",
r"^\s*\d+\s+(verbose|silly|http)\s+.*", # npm verbose logs
r"DEPRECATION WARNING.*",
r"^\s*~+$", # Underline decorations
]
compiled_patterns = [re.compile(p) for p in NOISE_PATTERNS]
def remove_noise(content: str) -> tuple:
"""Remove diagnostic noise lines from content."""
lines = content.split("\n")
clean_lines = []
removed_count = 0
for line in lines:
is_noise = any(p.search(line) for p in compiled_patterns)
if is_noise:
removed_count += 1
else:
clean_lines.append(line)
if removed_count > 0:
clean_lines.append(f"[{removed_count} diagnostic lines removed]")
return "\n".join(clean_lines), removed_count
def prune_messages_noise(messages: list) -> list:
"""Apply noise removal to all messages."""
total_removed = 0
cleaned = []
for msg in messages:
clean_content, removed = remove_noise(msg["content"])
total_removed += removed
cleaned.append({**msg, "content": clean_content})
return cleaned, total_removed
Strategy 2: Deduplicate Code Blocks
When Claude iterates on code, the context accumulates multiple versions of the same file. Only the latest version matters.
import hashlib
def deduplicate_code_blocks(messages: list) -> list:
"""Keep only the latest version of each code block."""
# First pass: find all code blocks and their positions
code_versions = {} # filename -> list of (msg_index, content_hash, content)
for i, msg in enumerate(messages):
content = msg["content"]
# Extract file references and code blocks
blocks = re.findall(r"(?:File:\s*`?([^`\n]+)`?\n)?```[\w]*\n(.*?)```", content, re.DOTALL)
for filename, code in blocks:
if not filename:
filename = f"unnamed_block_{hashlib.md5(code[:100].encode()).hexdigest()[:6]}"
code_hash = hashlib.md5(code.strip().encode()).hexdigest()
if filename not in code_versions:
code_versions[filename] = []
code_versions[filename].append({"msg_idx": i, "hash": code_hash})
# Identify outdated versions (all except the last occurrence of each file)
outdated_hashes = set()
for filename, versions in code_versions.items():
if len(versions) > 1:
for v in versions[:-1]: # All except last
outdated_hashes.add((v["msg_idx"], v["hash"]))
# Second pass: replace outdated code blocks with references
cleaned = []
tokens_saved = 0
for i, msg in enumerate(messages):
content = msg["content"]
blocks = re.findall(r"```[\w]*\n(.*?)```", content, re.DOTALL)
for code in blocks:
code_hash = hashlib.md5(code.strip().encode()).hexdigest()
if (i, code_hash) in outdated_hashes:
placeholder = "[superseded code version removed]"
content = content.replace(f"```\n{code}```", placeholder, 1)
content = content.replace(f"```python\n{code}```", placeholder, 1)
tokens_saved += len(code) // 4
cleaned.append({**msg, "content": content})
return cleaned, tokens_saved
Strategy 3: Summarize Old Conversation History
import anthropic
client = anthropic.Anthropic()
def summarize_old_turns(
messages: list,
keep_recent: int = 8,
model: str = "claude-haiku-4-5-20251001",
) -> list:
"""Replace old conversation turns with a concise summary."""
if len(messages) <= keep_recent:
return messages
old_messages = messages[:-keep_recent]
recent_messages = messages[-keep_recent:]
# Build summary of old messages
old_text = "\n".join(
f"{m['role']}: {m['content'][:500]}" for m in old_messages
)
summary_resp = client.messages.create(
model=model, # Use Haiku for cheap summarization
max_tokens=800,
system="Summarize this conversation concisely. Preserve: decisions made, files changed, current task state, unresolved items. Max 250 words.",
messages=[{"role": "user", "content": old_text}],
)
summary = summary_resp.content[0].text
return [
{"role": "user", "content": f"[Summary of previous {len(old_messages)} messages: {summary}]"},
{"role": "assistant", "content": "Understood. Continuing with this context."},
] + recent_messages
Strategy 4: Truncate Long Tool Outputs
def truncate_tool_outputs(messages: list, max_output_lines: int = 40) -> list:
"""Truncate verbose tool outputs while keeping key information."""
cleaned = []
for msg in messages:
content = msg["content"]
lines = content.split("\n")
if len(lines) > max_output_lines:
# Keep first 15 and last 15 lines, add truncation marker
head = lines[:15]
tail = lines[-15:]
truncated = len(lines) - 30
content = "\n".join(head + [f"\n[...{truncated} lines truncated...]\n"] + tail)
cleaned.append({**msg, "content": content})
return cleaned
Complete Pruning Pipeline
def full_prune_pipeline(messages: list) -> dict:
"""Apply all pruning strategies in order of impact."""
original_tokens = sum(len(m["content"]) // 4 for m in messages)
# Strategy 1: Noise removal
messages, noise_removed = prune_messages_noise(messages)
# Strategy 2: Code deduplication
messages, code_tokens_saved = deduplicate_code_blocks(messages)
# Strategy 3: Truncate outputs
messages = truncate_tool_outputs(messages)
# Strategy 4: Summarize old turns
messages = summarize_old_turns(messages)
final_tokens = sum(len(m["content"]) // 4 for m in messages)
reduction = (1 - final_tokens / max(original_tokens, 1)) * 100
return {
"messages": messages,
"original_tokens": original_tokens,
"final_tokens": final_tokens,
"reduction": f"{reduction:.1f}%",
"opus_savings_per_request": f"${(original_tokens - final_tokens) * 5 / 1_000_000:.4f}",
}
The Tradeoffs
Pruning is irreversible within a session. If Claude later needs information that was pruned, you must re-provide it. Mitigate by pruning conservatively (start with noise removal only) and expanding pruning strategies gradually.
Summarization via Haiku adds cost ($0.001-$0.01 per summary) and latency (0.5-2 seconds). This is negligible compared to the savings but worth tracking.
Aggressive code deduplication can remove intermediate versions that showed Claude’s reasoning. If debugging Claude’s approach, keep more versions.
Implementation Checklist
- Implement noise removal first (zero risk, highest impact)
- Add code deduplication for coding-heavy sessions
- Implement output truncation at 40 lines
- Add history summarization for sessions longer than 15 turns
- Monitor context size before and after pruning
- Track quality impact on response usefulness
- Tune thresholds based on your specific workflow
Measuring Impact
Measure average context size per interaction with and without pruning. Target a 50-70% reduction. Calculate monthly cost savings using the formula: (original_tokens - pruned_tokens) * interactions * rate / 1,000,000 * 30. Track the re-explanation rate (how often you need to re-provide pruned information) — keep this under 5%.
Related Guides
- Why Is Claude Code Expensive — context as the primary cost driver
- Claude Code Context Window Management Guide — complementary management strategies
- Why Does Anthropic Limit Claude Code Context Window — understanding context design decisions