System Prompt Optimization to Cut Claude Costs
Your system prompt is the most expensive text in your Claude API pipeline because it ships with every single request. A 2,000-token system prompt across 10,000 daily Opus 4.7 requests costs $100/day in system prompt tokens alone. Compressing it to 500 tokens saves $75/day — $2,250/month — with no change to output quality.
The Setup
System prompts define Claude’s behavior for a session or request type. Developers write them once and rarely revisit them. Over time, they accumulate redundant instructions, verbose examples, and unused constraints.
Because the system prompt is included in every request’s input tokens, even small reductions multiply across your entire request volume. A 100-token reduction saves $0.50 per 1,000 requests on Opus 4.7 — scale that to 300,000 requests/month and you save $150/month from removing just 100 tokens.
This guide shows systematic methods to shrink system prompts while preserving (or improving) output quality.
The Math
Before: 2,000-token system prompt on Opus 4.7
- Cost per request (system prompt only): 2,000 * $5.00/MTok = $0.010
- At 10,000 requests/day: $100/day -> $3,000/month
- At 50,000 requests/day: $500/day -> $15,000/month
After: 500-token system prompt
- Cost per request: 500 * $5.00/MTok = $0.0025
- At 10,000 requests/day: $25/day -> $750/month
- At 50,000 requests/day: $125/day -> $3,750/month
Savings at 10K/day: $2,250/month (75%) Savings at 50K/day: $11,250/month (75%)
Combined with prompt caching (cache read at $0.50/MTok = 90% off):
- 500-token cached system prompt: $0.00025 per request
- At 10,000/day: $2.50/day -> $75/month
- Total savings vs uncached 2K prompt: $2,925/month (97.5%)
The Technique
Step 1: Audit Your Current System Prompt
import anthropic
client = anthropic.Anthropic()
CURRENT_SYSTEM = """
You are a highly skilled and experienced software engineer assistant.
Your primary role is to help developers write high-quality code. You should
always follow best practices for the programming language being used. When
reviewing code, you should look for potential bugs, security vulnerabilities,
performance issues, and code style problems. You should provide clear and
actionable feedback. Always explain why something is a problem, not just
what the problem is.
When generating code, follow these rules:
- Use meaningful variable names
- Add comments for complex logic
- Handle errors appropriately
- Write testable code
- Follow the DRY principle
- Keep functions small and focused
Output formatting:
- Use markdown code blocks for code
- Use bullet points for lists of issues
- Start with the most critical issues first
- Include line numbers when referencing existing code
"""
# Count tokens
count = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=CURRENT_SYSTEM,
messages=[{"role": "user", "content": "test"}],
)
print(f"Current system prompt tokens: {count.input_tokens}")
Step 2: Apply Compression Rules
# Rule 1: Remove adjectives and filler ("highly skilled", "primary role")
# Rule 2: Merge redundant instructions
# Rule 3: Use shorthand for well-known concepts
# Rule 4: Remove instructions Claude follows by default
OPTIMIZED_SYSTEM = """Senior code reviewer. For each issue found:
- Line number, severity (critical/warning/info), description, fix
Critical first. Use markdown code blocks for code suggestions."""
# This captures the same behavior in ~40 tokens vs ~250
count_opt = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=OPTIMIZED_SYSTEM,
messages=[{"role": "user", "content": "test"}],
)
print(f"Optimized system prompt tokens: {count_opt.input_tokens}")
Step 3: Validate Quality Parity
import json
test_cases = [
"Review this Python function:\n\ndef get_user(id):\n conn = sqlite3.connect('db.sqlite')\n result = conn.execute(f'SELECT * FROM users WHERE id = {id}')\n return result.fetchone()",
"Review this:\n\ndef divide(a, b):\n return a / b",
"Review this:\n\ndef process(data):\n for item in data:\n try:\n result = transform(item)\n except:\n pass",
]
def compare_quality(system_a: str, system_b: str, test_cases: list, model: str = "claude-sonnet-4-6") -> dict:
"""Compare output quality between two system prompts."""
results = []
for case in test_cases:
resp_a = client.messages.create(
model=model, max_tokens=1024, system=system_a,
messages=[{"role": "user", "content": case}],
)
resp_b = client.messages.create(
model=model, max_tokens=1024, system=system_b,
messages=[{"role": "user", "content": case}],
)
results.append({
"input": case[:50] + "...",
"original_len": len(resp_a.content[0].text),
"optimized_len": len(resp_b.content[0].text),
"original_tokens": resp_a.usage.output_tokens,
"optimized_tokens": resp_b.usage.output_tokens,
})
total_orig = sum(r["original_tokens"] for r in results)
total_opt = sum(r["optimized_tokens"] for r in results)
return {
"test_cases": len(results),
"original_output_tokens": total_orig,
"optimized_output_tokens": total_opt,
"output_reduction": f"{(1 - total_opt/total_orig)*100:.1f}%",
}
report = compare_quality(CURRENT_SYSTEM, OPTIMIZED_SYSTEM, test_cases)
print(json.dumps(report, indent=2))
Step 4: Add Caching for Maximum Savings
# Cache the optimized system prompt to save an additional 90%
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[{
"type": "text",
"text": OPTIMIZED_SYSTEM,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": "Review this code: ..."}],
)
# First request: cache write at $6.25/MTok
# Subsequent requests: cache read at $0.50/MTok (90% savings)
The Tradeoffs
The biggest risk is removing instructions that seemed redundant but actually influenced Claude’s behavior. Always validate with a test suite before deploying compressed prompts. Some behaviors (like output formatting preferences) are not default and must remain explicit.
Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5. A 500-token system prompt alone may not meet this minimum. Combine it with other cacheable context (few-shot examples, reference documents) to reach the threshold.
The 5-minute cache TTL means you need at least one request every 5 minutes to keep the cache warm. Low-traffic endpoints may not benefit from caching.
Implementation Checklist
- Copy your current system prompt and count its tokens
- Remove adjectives, filler phrases, and polite language
- Merge overlapping instructions into single statements
- Remove instructions that describe Claude’s default behavior
- Run quality comparison on 50+ test cases
- Deploy the compressed prompt with caching enabled
- Monitor output quality for one week before considering it stable
Measuring Impact
The primary metric is system prompt token count before and after. Multiply the delta by your daily request volume and model input rate to get daily savings. Secondary metric: output quality score on a held-out test set — this should remain flat or improve (shorter prompts often produce better results because Claude has less conflicting guidance). Track cache hit rate separately if using prompt caching.
Related Guides
- Claude Code Token Usage Optimization — broader token optimization beyond system prompts
- Reduce Claude Code Hallucinations Save Tokens — better prompts reduce both tokens and errors
- Claude Skill Token Usage Profiling — profile which prompts cost the most