Before caching (Sonnet 4.6, 50K system prompt, 1,000 calls/day): 1,000 x 50,000 tokens x $3.00/MTok = $150.00/day Monthly: $4,500.00 After caching: Daily cache writes: ~288 (one per 5-minute window across 24 hours) Cache write cost: 288 x 50,000 x $3.75/MTok = $54.00/day Cache reads...

This implementation has known limitations: First-request latency: The initial cache write adds a small latency overhead (typically <100ms) compared to uncached requests. All subsequent reads are faster. Prompt immutability: Any change to the system prompt text invalidates the cache. Sche...

Claude Prompt Caching Implementation (2026)

Last updated: April 19, 2026

Adding prompt caching to a Claude API integration takes roughly 10 lines of code and saves up to 90% on input token costs. A customer support bot spending $4,500/month on Sonnet 4.6 input tokens drops to $455/month after enabling caching on its 50,000-token system prompt.

The Setup

You have an existing Claude API integration processing customer queries. Each request sends a large system prompt (product docs, tone guidelines, response templates) plus the customer message. The system prompt is identical across all requests, but you are paying full input price for it every time.

Your current monthly bill: $4,500 for input tokens alone. After this tutorial, you will pay $455 – a $4,045 monthly reduction. The entire implementation takes under 30 minutes.

The Math

Before caching (Sonnet 4.6, 50K system prompt, 1,000 calls/day):

1,000 x 50,000 tokens x $3.00/MTok = $150.00/day
Monthly: $4,500.00

After caching:

Daily cache writes: ~288 (one per 5-minute window across 24 hours)
Cache write cost: 288 x 50,000 x $3.75/MTok = $54.00/day
Cache reads: ~712 x 50,000 x $0.30/MTok = $10.68/day
Uncached user messages: 1,000 x 500 avg tokens x $3.00/MTok = $1.50/day
Total: $66.18/day = $1,985/month

With concentrated traffic (all 1,000 requests within a few 5-minute windows):

~3 cache writes + 997 cache reads
Total: $15.17/day = $455/month

Savings: $4,045/month (90%)

The Technique

Here is a complete implementation, from zero caching to fully cached, in four steps.

Step 1: Identify cacheable content. Separate your static system prompt from dynamic content.

# BEFORE: No caching
import anthropic
client = anthropic.Anthropic()
def handle_query_no_cache(user_message: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system="You are a support agent for Acme Corp...",  # 50K tokens
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

Step 2: Convert system prompt to structured format with cache breakpoint.

# AFTER: With caching
def handle_query_cached(user_message: str) -> str:
    system_prompt = open("system_prompt.txt").read()  # Load once

    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

The critical change: system goes from a plain string to a list of content blocks, and the block containing your static prompt gets cache_control added.

Step 3: Add monitoring to verify caching is active.

import logging
logger = logging.getLogger("cache_monitor")
def handle_query_monitored(user_message: str) -> str:
    system_prompt = open("system_prompt.txt").read()
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    usage = response.usage
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)
    if cache_write > 0:
        logger.info(f"CACHE WRITE: {cache_write} tokens")
    elif cache_read > 0:
        logger.info(f"CACHE READ: {cache_read} tokens (saving 90%)")
    else:
        logger.warning("NO CACHING: check prompt size vs minimum threshold")
    return response.content[0].text

Step 4: Add multiple breakpoints for layered content.

def handle_query_multi_cache(
    user_message: str,
    product_docs: str,      # 30K tokens, updated weekly
    response_templates: str  # 20K tokens, updated monthly
) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": response_templates,
                "cache_control": {"type": "ephemeral"}  # BP 1: most stable
            },
            {
                "type": "text",
                "text": product_docs,
                "cache_control": {"type": "ephemeral"}  # BP 2: semi-stable
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

Verify your token counts meet minimums before deploying:

# Quick token count check
python3 -c "
text = open('system_prompt.txt').read()
# Rough estimate: 1 token ~= 4 characters for English text
est_tokens = len(text) / 4
model_mins = {'Opus 4.7': 4096, 'Sonnet 4.6': 1024, 'Haiku 4.5': 4096}
print(f'Estimated tokens: {est_tokens:.0f}')
for m, t in model_mins.items():
    print(f'  {m}: {\"OK\" if est_tokens >= t else \"BELOW MINIMUM\"} (need {t})')
"

The Tradeoffs

This implementation has known limitations:

First-request latency: The initial cache write adds a small latency overhead (typically <100ms) compared to uncached requests. All subsequent reads are faster.
Prompt immutability: Any change to the system prompt text invalidates the cache. Schedule prompt updates during low-traffic periods to minimize write costs.
No SDK-level caching: The Anthropic SDK does not cache locally. Every request goes to the API, where server-side caching applies. Network latency is unchanged.
Token counting complexity: The response usage object reports cache reads and writes separately. Your billing dashboard may lag behind real-time usage.

Minimum cacheable token counts vary by model and can cause silent failures. Opus 4.7 and Haiku 4.5 require at least 4,096 tokens before a cache breakpoint. Sonnet 4.6 requires 1,024 tokens. If your system prompt falls below these thresholds, the cache_control parameter is silently ignored – no error is raised, no cache is created, and you pay full input price on every request. Always verify caching is active by checking that cache_read_input_tokens > 0 on the second request.

Implementation Checklist

Identify your largest static prompt content (system prompts, docs, examples)
Confirm content exceeds minimum token threshold (1,024 for Sonnet 4.6, 4,096 for Opus 4.7)
Convert system parameter from string to structured content block format
Add cache_control: {"type": "ephemeral"} to each static block
Deploy monitoring that logs cache_read_input_tokens and cache_creation_input_tokens
Verify cache reads are occurring on the second request onward
Compare daily API cost before and after for one week

Measuring Impact

Track these metrics starting from day one:

Cache activation rate: Percentage of requests showing cache_read_input_tokens > 0 (target: 95%+)
Daily input cost: Should drop 80-90% within the first 24 hours of enabling caching
Cost per request: Track the moving average. Expected: $0.015 per request (Sonnet, 50K cached) vs $0.15 without caching
Set up a weekly cost report comparing pre-caching baseline to current spend

Which model? → Take the 5-question quiz in our Model Selector.

Try it: Estimate your monthly spend with our Cost Calculator.

Claude Prompt Caching Implementation (2026)

The Setup

The Math

The Technique

The Tradeoffs

Implementation Checklist

Measuring Impact

See Also

About the Author

The Setup

The Math

The Technique

The Tradeoffs

Implementation Checklist

Measuring Impact

Related Guides

See Also

About the Author

Related Guides