System Prompt Optimization to Cut Claude Costs

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Your system prompt is the most expensive text in your Claude API pipeline because it ships with every single request. A 2,000-token system prompt across 10,000 daily Opus 4.7 requests costs $100/day in system prompt tokens alone. Compressing it to 500 tokens saves $75/day — $2,250/month — with no change to output quality.

The Setup

System prompts define Claude’s behavior for a session or request type. Developers write them once and rarely revisit them. Over time, they accumulate redundant instructions, verbose examples, and unused constraints.

Because the system prompt is included in every request’s input tokens, even small reductions multiply across your entire request volume. A 100-token reduction saves $0.50 per 1,000 requests on Opus 4.7 — scale that to 300,000 requests/month and you save $150/month from removing just 100 tokens.

This guide shows systematic methods to shrink system prompts while preserving (or improving) output quality.

The Math

Before: 2,000-token system prompt on Opus 4.7

After: 500-token system prompt

Savings at 10K/day: $2,250/month (75%) Savings at 50K/day: $11,250/month (75%)

Combined with prompt caching (cache read at $0.50/MTok = 90% off):

The Technique

Step 1: Audit Your Current System Prompt

import anthropic

client = anthropic.Anthropic()

CURRENT_SYSTEM = """
You are a highly skilled and experienced software engineer assistant.
Your primary role is to help developers write high-quality code. You should
always follow best practices for the programming language being used. When
reviewing code, you should look for potential bugs, security vulnerabilities,
performance issues, and code style problems. You should provide clear and
actionable feedback. Always explain why something is a problem, not just
what the problem is.

When generating code, follow these rules:
- Use meaningful variable names
- Add comments for complex logic
- Handle errors appropriately
- Write testable code
- Follow the DRY principle
- Keep functions small and focused

Output formatting:
- Use markdown code blocks for code
- Use bullet points for lists of issues
- Start with the most critical issues first
- Include line numbers when referencing existing code
"""

# Count tokens
count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=CURRENT_SYSTEM,
    messages=[{"role": "user", "content": "test"}],
)
print(f"Current system prompt tokens: {count.input_tokens}")

Step 2: Apply Compression Rules

# Rule 1: Remove adjectives and filler ("highly skilled", "primary role")
# Rule 2: Merge redundant instructions
# Rule 3: Use shorthand for well-known concepts
# Rule 4: Remove instructions Claude follows by default

OPTIMIZED_SYSTEM = """Senior code reviewer. For each issue found:
- Line number, severity (critical/warning/info), description, fix
Critical first. Use markdown code blocks for code suggestions."""

# This captures the same behavior in ~40 tokens vs ~250
count_opt = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=OPTIMIZED_SYSTEM,
    messages=[{"role": "user", "content": "test"}],
)
print(f"Optimized system prompt tokens: {count_opt.input_tokens}")

Step 3: Validate Quality Parity

import json

test_cases = [
    "Review this Python function:\n\ndef get_user(id):\n    conn = sqlite3.connect('db.sqlite')\n    result = conn.execute(f'SELECT * FROM users WHERE id = {id}')\n    return result.fetchone()",
    "Review this:\n\ndef divide(a, b):\n    return a / b",
    "Review this:\n\ndef process(data):\n    for item in data:\n        try:\n            result = transform(item)\n        except:\n            pass",
]

def compare_quality(system_a: str, system_b: str, test_cases: list, model: str = "claude-sonnet-4-6") -> dict:
    """Compare output quality between two system prompts."""
    results = []
    for case in test_cases:
        resp_a = client.messages.create(
            model=model, max_tokens=1024, system=system_a,
            messages=[{"role": "user", "content": case}],
        )
        resp_b = client.messages.create(
            model=model, max_tokens=1024, system=system_b,
            messages=[{"role": "user", "content": case}],
        )
        results.append({
            "input": case[:50] + "...",
            "original_len": len(resp_a.content[0].text),
            "optimized_len": len(resp_b.content[0].text),
            "original_tokens": resp_a.usage.output_tokens,
            "optimized_tokens": resp_b.usage.output_tokens,
        })

    total_orig = sum(r["original_tokens"] for r in results)
    total_opt = sum(r["optimized_tokens"] for r in results)
    return {
        "test_cases": len(results),
        "original_output_tokens": total_orig,
        "optimized_output_tokens": total_opt,
        "output_reduction": f"{(1 - total_opt/total_orig)*100:.1f}%",
    }

report = compare_quality(CURRENT_SYSTEM, OPTIMIZED_SYSTEM, test_cases)
print(json.dumps(report, indent=2))

Step 4: Add Caching for Maximum Savings

# Cache the optimized system prompt to save an additional 90%
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": OPTIMIZED_SYSTEM,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": "Review this code: ..."}],
)
# First request: cache write at $6.25/MTok
# Subsequent requests: cache read at $0.50/MTok (90% savings)

The Tradeoffs

The biggest risk is removing instructions that seemed redundant but actually influenced Claude’s behavior. Always validate with a test suite before deploying compressed prompts. Some behaviors (like output formatting preferences) are not default and must remain explicit.

Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5. A 500-token system prompt alone may not meet this minimum. Combine it with other cacheable context (few-shot examples, reference documents) to reach the threshold.

The 5-minute cache TTL means you need at least one request every 5 minutes to keep the cache warm. Low-traffic endpoints may not benefit from caching.

Implementation Checklist

  1. Copy your current system prompt and count its tokens
  2. Remove adjectives, filler phrases, and polite language
  3. Merge overlapping instructions into single statements
  4. Remove instructions that describe Claude’s default behavior
  5. Run quality comparison on 50+ test cases
  6. Deploy the compressed prompt with caching enabled
  7. Monitor output quality for one week before considering it stable

Measuring Impact

The primary metric is system prompt token count before and after. Multiply the delta by your daily request volume and model input rate to get daily savings. Secondary metric: output quality score on a held-out test set — this should remain flat or improve (shorter prompts often produce better results because Claude has less conflicting guidance). Track cache hit rate separately if using prompt caching.