Claude Prompt Caching Implementation Tutorial

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Adding prompt caching to a Claude API integration takes roughly 10 lines of code and saves up to 90% on input token costs. A customer support bot spending $4,500/month on Sonnet 4.6 input tokens drops to $455/month after enabling caching on its 50,000-token system prompt.

The Setup

You have an existing Claude API integration processing customer queries. Each request sends a large system prompt (product docs, tone guidelines, response templates) plus the customer message. The system prompt is identical across all requests, but you are paying full input price for it every time.

Your current monthly bill: $4,500 for input tokens alone. After this tutorial, you will pay $455 – a $4,045 monthly reduction. The entire implementation takes under 30 minutes.

The Math

Before caching (Sonnet 4.6, 50K system prompt, 1,000 calls/day):

After caching:

With concentrated traffic (all 1,000 requests within a few 5-minute windows):

Savings: $4,045/month (90%)

The Technique

Here is a complete implementation, from zero caching to fully cached, in four steps.

Step 1: Identify cacheable content. Separate your static system prompt from dynamic content.

# BEFORE: No caching
import anthropic

client = anthropic.Anthropic()

def handle_query_no_cache(user_message: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system="You are a support agent for Acme Corp...",  # 50K tokens
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

Step 2: Convert system prompt to structured format with cache breakpoint.

# AFTER: With caching
def handle_query_cached(user_message: str) -> str:
    system_prompt = open("system_prompt.txt").read()  # Load once

    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

The critical change: system goes from a plain string to a list of content blocks, and the block containing your static prompt gets cache_control added.

Step 3: Add monitoring to verify caching is active.

import logging

logger = logging.getLogger("cache_monitor")

def handle_query_monitored(user_message: str) -> str:
    system_prompt = open("system_prompt.txt").read()

    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    usage = response.usage
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)

    if cache_write > 0:
        logger.info(f"CACHE WRITE: {cache_write} tokens")
    elif cache_read > 0:
        logger.info(f"CACHE READ: {cache_read} tokens (saving 90%)")
    else:
        logger.warning("NO CACHING: check prompt size vs minimum threshold")

    return response.content[0].text

Step 4: Add multiple breakpoints for layered content.

def handle_query_multi_cache(
    user_message: str,
    product_docs: str,      # 30K tokens, updated weekly
    response_templates: str  # 20K tokens, updated monthly
) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": response_templates,
                "cache_control": {"type": "ephemeral"}  # BP 1: most stable
            },
            {
                "type": "text",
                "text": product_docs,
                "cache_control": {"type": "ephemeral"}  # BP 2: semi-stable
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

Verify your token counts meet minimums before deploying:

# Quick token count check
python3 -c "
text = open('system_prompt.txt').read()
# Rough estimate: 1 token ~= 4 characters for English text
est_tokens = len(text) / 4
model_mins = {'Opus 4.7': 4096, 'Sonnet 4.6': 1024, 'Haiku 4.5': 4096}
print(f'Estimated tokens: {est_tokens:.0f}')
for m, t in model_mins.items():
    print(f'  {m}: {\"OK\" if est_tokens >= t else \"BELOW MINIMUM\"} (need {t})')
"

The Tradeoffs

This implementation has known limitations:

Minimum cacheable token counts vary by model and can cause silent failures. Opus 4.7 and Haiku 4.5 require at least 4,096 tokens before a cache breakpoint. Sonnet 4.6 requires 1,024 tokens. If your system prompt falls below these thresholds, the cache_control parameter is silently ignored – no error is raised, no cache is created, and you pay full input price on every request. Always verify caching is active by checking that cache_read_input_tokens > 0 on the second request.

Implementation Checklist

  1. Identify your largest static prompt content (system prompts, docs, examples)
  2. Confirm content exceeds minimum token threshold (1,024 for Sonnet 4.6, 4,096 for Opus 4.7)
  3. Convert system parameter from string to structured content block format
  4. Add cache_control: {"type": "ephemeral"} to each static block
  5. Deploy monitoring that logs cache_read_input_tokens and cache_creation_input_tokens
  6. Verify cache reads are occurring on the second request onward
  7. Compare daily API cost before and after for one week

Measuring Impact

Track these metrics starting from day one: