Before: 2,000-token system prompt on Opus 4.7 Cost per request (system prompt only): 2,000 * $5.00/MTok = $0.010 At 10,000 requests/day: $100/day -> $3,000/month At 50,000 requests/day: $500/day -> $15,000/month After: 500-token system prompt Cost per request: 500 * $5.00/MTok =...

System Prompt Optimization to Cut (2026)

Last updated: April 19, 2026

Your system prompt is the most expensive text in your Claude API pipeline because it ships with every single request. A 2,000-token system prompt across 10,000 daily Opus 4.7 requests costs $100/day in system prompt tokens alone. Compressing it to 500 tokens saves $75/day — $2,250/month — with no change to output quality.

The Setup

System prompts define Claude’s behavior for a session or request type. Developers write them once and rarely revisit them. Over time, they accumulate redundant instructions, verbose examples, and unused constraints.

Because the system prompt is included in every request’s input tokens, even small reductions multiply across your entire request volume. A 100-token reduction saves $0.50 per 1,000 requests on Opus 4.7 — scale that to 300,000 requests/month and you save $150/month from removing just 100 tokens.

This guide shows systematic methods to shrink system prompts while preserving (or improving) output quality.

The Math

Before: 2,000-token system prompt on Opus 4.7

Cost per request (system prompt only): 2,000 * $5.00/MTok = $0.010
At 10,000 requests/day: $100/day -> $3,000/month
At 50,000 requests/day: $500/day -> $15,000/month

After: 500-token system prompt

Cost per request: 500 * $5.00/MTok = $0.0025
At 10,000 requests/day: $25/day -> $750/month
At 50,000 requests/day: $125/day -> $3,750/month

Savings at 10K/day: $2,250/month (75%) Savings at 50K/day: $11,250/month (75%)

Combined with prompt caching (cache read at $0.50/MTok = 90% off):

500-token cached system prompt: $0.00025 per request
At 10,000/day: $2.50/day -> $75/month
Total savings vs uncached 2K prompt: $2,925/month (97.5%)

The Technique

Step 1: Audit Your Current System Prompt

import anthropic
client = anthropic.Anthropic()
CURRENT_SYSTEM = """
You are a highly skilled and experienced software engineer assistant.
Your primary role is to help developers write high-quality code. You should
always follow best practices for the programming language being used. When
reviewing code, you should look for potential bugs, security vulnerabilities,
performance issues, and code style problems. You should provide clear and
actionable feedback. Always explain why something is a problem, not just
what the problem is.
When generating code, follow these rules:
- Use meaningful variable names
- Add comments for complex logic
- Handle errors appropriately
- Write testable code
- Follow the DRY principle
- Keep functions small and focused
Output formatting:
- Use markdown code blocks for code
- Use bullet points for lists of issues
- Start with the most critical issues first
- Include line numbers when referencing existing code
"""
# Count tokens
count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=CURRENT_SYSTEM,
    messages=[{"role": "user", "content": "test"}],
)
print(f"Current system prompt tokens: {count.input_tokens}")

Step 2: Apply Compression Rules

# Rule 1: Remove adjectives and filler ("highly skilled", "primary role")
# Rule 2: Merge redundant instructions
# Rule 3: Use shorthand for well-known concepts
# Rule 4: Remove instructions Claude follows by default

OPTIMIZED_SYSTEM = """Senior code reviewer. For each issue found:
- Line number, severity (critical/warning/info), description, fix
Critical first. Use markdown code blocks for code suggestions."""
# This captures the same behavior in ~40 tokens vs ~250
count_opt = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=OPTIMIZED_SYSTEM,
    messages=[{"role": "user", "content": "test"}],
)
print(f"Optimized system prompt tokens: {count_opt.input_tokens}")

Step 3: Validate Quality Parity

import json
test_cases = [
    "Review this Python function:\n\ndef get_user(id):\n    conn = sqlite3.connect('db.sqlite')\n    result = conn.execute(f'SELECT * FROM users WHERE id = {id}')\n    return result.fetchone()",
    "Review this:\n\ndef divide(a, b):\n    return a / b",
    "Review this:\n\ndef process(data):\n    for item in data:\n        try:\n            result = transform(item)\n        except:\n            pass",
]
def compare_quality(system_a: str, system_b: str, test_cases: list, model: str = "claude-sonnet-4-6") -> dict:
    """Compare output quality between two system prompts."""
    results = []
    for case in test_cases:
        resp_a = client.messages.create(
            model=model, max_tokens=1024, system=system_a,
            messages=[{"role": "user", "content": case}],
        )
        resp_b = client.messages.create(
            model=model, max_tokens=1024, system=system_b,
            messages=[{"role": "user", "content": case}],
        )
        results.append({
            "input": case[:50] + "...",
            "original_len": len(resp_a.content[0].text),
            "optimized_len": len(resp_b.content[0].text),
            "original_tokens": resp_a.usage.output_tokens,
            "optimized_tokens": resp_b.usage.output_tokens,
        })
    total_orig = sum(r["original_tokens"] for r in results)
    total_opt = sum(r["optimized_tokens"] for r in results)
    return {
        "test_cases": len(results),
        "original_output_tokens": total_orig,
        "optimized_output_tokens": total_opt,
        "output_reduction": f"{(1 - total_opt/total_orig)*100:.1f}%",
    }
report = compare_quality(CURRENT_SYSTEM, OPTIMIZED_SYSTEM, test_cases)
print(json.dumps(report, indent=2))

Step 4: Add Caching for Maximum Savings

# Cache the optimized system prompt to save an additional 90%
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": OPTIMIZED_SYSTEM,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": "Review this code: ..."}],
)
# First request: cache write at $6.25/MTok
# Subsequent requests: cache read at $0.50/MTok (90% savings)

The Tradeoffs

The biggest risk is removing instructions that seemed redundant but actually influenced Claude’s behavior. Always validate with a test suite before deploying compressed prompts. Some behaviors (like output formatting preferences) are not default and must remain explicit.

Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5. A 500-token system prompt alone may not meet this minimum. Combine it with other cacheable context (few-shot examples, reference documents) to reach the threshold.

The 5-minute cache TTL means you need at least one request every 5 minutes to keep the cache warm. Low-traffic endpoints may not benefit from caching.

Implementation Checklist

Copy your current system prompt and count its tokens
Remove adjectives, filler phrases, and polite language
Merge overlapping instructions into single statements
Remove instructions that describe Claude’s default behavior
Run quality comparison on 50+ test cases
Deploy the compressed prompt with caching enabled
Monitor output quality for one week before considering it stable

Measuring Impact

The primary metric is system prompt token count before and after. Multiply the delta by your daily request volume and model input rate to get daily savings. Secondary metric: output quality score on a held-out test set — this should remain flat or improve (shorter prompts often produce better results because Claude has less conflicting guidance). Track cache hit rate separately if using prompt caching.

Which model? → Take the 5-question quiz in our Model Selector.

Estimate tokens → Calculate your usage with our Token Estimator.

Try it: Estimate your monthly spend with our Cost Calculator.

Claude Code Token Usage Optimization — broader token optimization beyond system prompts
Reduce Claude Code Hallucinations Save Tokens — better prompts reduce both tokens and errors
Claude Skill Token Usage Profiling — profile which prompts cost the most