Prompt Compression Techniques for Claude API
Switching from verbose JSON examples to concise XML tags in your Claude prompts reduces token count by 30% — saving $75 per 10,000 requests on Opus 4.7. At scale, prompt compression is the highest-ROI cost optimization because it applies to every single API call.
The Setup
Every token in your prompt costs money. On Claude Opus 4.7, each million input tokens costs $5.00. A 50,000-token conversation context that could be expressed in 35,000 tokens wastes $0.075 per request — $2,250/month at 1,000 daily requests.
Prompt compression reduces token count without losing information. The techniques range from simple (removing filler words) to structural (switching data formats). This guide covers six compression methods with before/after token counts and quality validation results.
The Math
Scenario: 1,000 conversations/day, 50K input tokens average, Opus 4.7
Before compression (50K tokens/request):
- Input cost: 50K * 1,000 * $5.00/MTok = $250/day -> $7,500/month
After 30% compression (35K tokens/request):
- Input cost: 35K * 1,000 * $5.00/MTok = $175/day -> $5,250/month
Savings: $2,250/month (30%)
After 50% compression (25K tokens/request):
- Input cost: 25K * 1,000 * $5.00/MTok = $125/day -> $3,750/month
Savings: $3,750/month (50%)
The same compression applied to Sonnet 4.6 at $3.00/MTok saves $1,350-$2,250/month. On Haiku 4.5 at $1.00/MTok, savings are $450-$750/month.
The Technique
1. XML Tags vs JSON for Context (30% Token Reduction)
Claude is optimized for XML tag parsing. XML uses fewer tokens than JSON for the same structured data.
# JSON context: ~180 tokens
json_context = """
{
"customer": {
"name": "Jane Smith",
"account_id": "ACC-12345",
"plan": "enterprise",
"issues": [
{"date": "2026-04-01", "type": "billing", "status": "resolved"},
{"date": "2026-04-15", "type": "technical", "status": "open"}
]
}
}
"""
# XML context: ~120 tokens (33% fewer)
xml_context = """
<customer>
<name>Jane Smith</name>
<id>ACC-12345</id>
<plan>enterprise</plan>
<issues>
<issue date="2026-04-01" type="billing" status="resolved"/>
<issue date="2026-04-15" type="technical" status="open"/>
</issues>
</customer>
"""
2. Instruction Deduplication
# BEFORE: Repeated instructions (~300 tokens)
verbose_system = """You are a code reviewer. Review the code for bugs.
Look carefully for any bugs in the code. Pay special attention to
potential bugs that could cause issues. When you find bugs, explain
what the bug is and how to fix the bug. Make sure to identify all bugs."""
# AFTER: Deduplicated (~80 tokens)
lean_system = """Code reviewer. For each bug found, state: location, issue, fix.
Be thorough. Output as numbered list."""
3. Reference Compression for Few-Shot Examples
import anthropic
client = anthropic.Anthropic()
# BEFORE: Full verbose examples (~2,000 tokens for 3 examples)
verbose_examples = """
Example 1:
Input: "The product arrived broken and nobody helped me fix it"
Analysis: This customer is expressing frustration about a damaged product
and poor customer service experience. The sentiment is clearly negative.
Output: NEGATIVE
Example 2:
Input: "Decent quality for the price point"
Analysis: The customer is giving a moderate assessment, neither strongly
positive nor negative. They find value adequate.
Output: NEUTRAL
Example 3:
Input: "Best purchase I've made all year, highly recommend"
Analysis: Strong positive sentiment with enthusiastic recommendation.
Output: POSITIVE
"""
# AFTER: Compact examples (~400 tokens for 3 examples)
compact_examples = """Examples:
"Product arrived broken, nobody helped" -> NEGATIVE
"Decent quality for the price" -> NEUTRAL
"Best purchase all year, highly recommend" -> POSITIVE"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
system=f"Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL.\n{compact_examples}",
messages=[{"role": "user", "content": "The delivery was on time but the packaging was damaged."}],
)
print(response.content[0].text)
4. Abbreviation Tables for Domain-Specific Prompts
# Define abbreviations once, use everywhere
abbreviation_header = """Abbreviations: req=request, resp=response, auth=authentication,
db=database, cfg=configuration, err=error, msg=message, usr=user"""
# Use abbreviated instructions
lean_instructions = """Review API endpoint:
- Check auth handling for missing/expired tokens
- Verify db queries use parameterized inputs
- Confirm err msgs don't leak internal details
- Validate req/resp schemas match docs"""
5. Structured Extraction Prompts
# BEFORE: Natural language (~200 tokens)
verbose_extract = """Please read the following document and extract the
following pieces of information: the person's full name, their email
address, their phone number, and their mailing address. Format the
extracted information as a JSON object with the keys 'name', 'email',
'phone', and 'address'."""
# AFTER: Tabular format (~60 tokens)
lean_extract = """Extract from document -> JSON:
{name, email, phone, address}
Null if missing. No extra keys."""
6. Token Counting Before and After
import anthropic
client = anthropic.Anthropic()
def count_tokens(text: str, model: str = "claude-sonnet-4-6") -> int:
"""Count tokens for a given text using the API."""
response = client.messages.count_tokens(
model=model,
messages=[{"role": "user", "content": text}],
)
return response.input_tokens
# Compare token counts
verbose = "Please analyze the following code and identify any potential security vulnerabilities, performance issues, or bugs that could cause problems in production."
compressed = "Analyze code for: security vulns, perf issues, bugs. List each with severity."
print(f"Verbose: {count_tokens(verbose)} tokens")
print(f"Compressed: {count_tokens(compressed)} tokens")
The Tradeoffs
Over-compression creates ambiguity. If Claude misinterprets a compressed instruction, the retry costs more than the tokens you saved. Keep instructions unambiguous even when brief — clarity beats brevity.
Abbreviations can confuse Claude on uncommon terms. Stick to widely understood abbreviations (API, DB, URL) and define domain-specific ones explicitly in the prompt header.
XML formatting saves tokens compared to JSON, but if your downstream systems expect JSON output, you still need to specify JSON output format in your prompt.
Implementation Checklist
- Count tokens in your top 5 system prompts using the token counting API
- Apply deduplication — remove repeated instructions
- Convert JSON context to XML where possible
- Compress few-shot examples to input/output pairs only
- Measure token reduction percentage
- A/B test compressed prompts on 100 requests for quality parity
- Deploy and monitor monthly token consumption
Measuring Impact
Measure compression ratio: (original tokens - compressed tokens) / original tokens. Target 30-50% compression on system prompts and 20-40% on user messages. Track quality scores alongside compression to catch any degradation. The ideal outcome is a flat quality curve with a declining token curve.
Related Guides
- Claude Code Token Usage Optimization — broader token optimization strategies
- Reduce Claude Code Hallucinations Save Tokens — prompt writing techniques that reduce waste
- Claude Skill Token Usage Profiling — identify which prompts consume the most tokens