Claude Prompt Caching Implementation Tutorial
Adding prompt caching to a Claude API integration takes roughly 10 lines of code and saves up to 90% on input token costs. A customer support bot spending $4,500/month on Sonnet 4.6 input tokens drops to $455/month after enabling caching on its 50,000-token system prompt.
The Setup
You have an existing Claude API integration processing customer queries. Each request sends a large system prompt (product docs, tone guidelines, response templates) plus the customer message. The system prompt is identical across all requests, but you are paying full input price for it every time.
Your current monthly bill: $4,500 for input tokens alone. After this tutorial, you will pay $455 – a $4,045 monthly reduction. The entire implementation takes under 30 minutes.
The Math
Before caching (Sonnet 4.6, 50K system prompt, 1,000 calls/day):
- 1,000 x 50,000 tokens x $3.00/MTok = $150.00/day
- Monthly: $4,500.00
After caching:
- Daily cache writes: ~288 (one per 5-minute window across 24 hours)
- Cache write cost: 288 x 50,000 x $3.75/MTok = $54.00/day
- Cache reads: ~712 x 50,000 x $0.30/MTok = $10.68/day
- Uncached user messages: 1,000 x 500 avg tokens x $3.00/MTok = $1.50/day
- Total: $66.18/day = $1,985/month
With concentrated traffic (all 1,000 requests within a few 5-minute windows):
- ~3 cache writes + 997 cache reads
- Total: $15.17/day = $455/month
Savings: $4,045/month (90%)
The Technique
Here is a complete implementation, from zero caching to fully cached, in four steps.
Step 1: Identify cacheable content. Separate your static system prompt from dynamic content.
# BEFORE: No caching
import anthropic
client = anthropic.Anthropic()
def handle_query_no_cache(user_message: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6-20250929",
max_tokens=4096,
system="You are a support agent for Acme Corp...", # 50K tokens
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
Step 2: Convert system prompt to structured format with cache breakpoint.
# AFTER: With caching
def handle_query_cached(user_message: str) -> str:
system_prompt = open("system_prompt.txt").read() # Load once
response = client.messages.create(
model="claude-sonnet-4-6-20250929",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
The critical change: system goes from a plain string to a list of content blocks, and the block containing your static prompt gets cache_control added.
Step 3: Add monitoring to verify caching is active.
import logging
logger = logging.getLogger("cache_monitor")
def handle_query_monitored(user_message: str) -> str:
system_prompt = open("system_prompt.txt").read()
response = client.messages.create(
model="claude-sonnet-4-6-20250929",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_message}]
)
usage = response.usage
cache_read = getattr(usage, "cache_read_input_tokens", 0)
cache_write = getattr(usage, "cache_creation_input_tokens", 0)
if cache_write > 0:
logger.info(f"CACHE WRITE: {cache_write} tokens")
elif cache_read > 0:
logger.info(f"CACHE READ: {cache_read} tokens (saving 90%)")
else:
logger.warning("NO CACHING: check prompt size vs minimum threshold")
return response.content[0].text
Step 4: Add multiple breakpoints for layered content.
def handle_query_multi_cache(
user_message: str,
product_docs: str, # 30K tokens, updated weekly
response_templates: str # 20K tokens, updated monthly
) -> str:
response = client.messages.create(
model="claude-sonnet-4-6-20250929",
max_tokens=4096,
system=[
{
"type": "text",
"text": response_templates,
"cache_control": {"type": "ephemeral"} # BP 1: most stable
},
{
"type": "text",
"text": product_docs,
"cache_control": {"type": "ephemeral"} # BP 2: semi-stable
}
],
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
Verify your token counts meet minimums before deploying:
# Quick token count check
python3 -c "
text = open('system_prompt.txt').read()
# Rough estimate: 1 token ~= 4 characters for English text
est_tokens = len(text) / 4
model_mins = {'Opus 4.7': 4096, 'Sonnet 4.6': 1024, 'Haiku 4.5': 4096}
print(f'Estimated tokens: {est_tokens:.0f}')
for m, t in model_mins.items():
print(f' {m}: {\"OK\" if est_tokens >= t else \"BELOW MINIMUM\"} (need {t})')
"
The Tradeoffs
This implementation has known limitations:
- First-request latency: The initial cache write adds a small latency overhead (typically <100ms) compared to uncached requests. All subsequent reads are faster.
- Prompt immutability: Any change to the system prompt text invalidates the cache. Schedule prompt updates during low-traffic periods to minimize write costs.
- No SDK-level caching: The Anthropic SDK does not cache locally. Every request goes to the API, where server-side caching applies. Network latency is unchanged.
- Token counting complexity: The response
usageobject reports cache reads and writes separately. Your billing dashboard may lag behind real-time usage.
Minimum cacheable token counts vary by model and can cause silent failures. Opus 4.7 and Haiku 4.5 require at least 4,096 tokens before a cache breakpoint. Sonnet 4.6 requires 1,024 tokens. If your system prompt falls below these thresholds, the cache_control parameter is silently ignored – no error is raised, no cache is created, and you pay full input price on every request. Always verify caching is active by checking that cache_read_input_tokens > 0 on the second request.
Implementation Checklist
- Identify your largest static prompt content (system prompts, docs, examples)
- Confirm content exceeds minimum token threshold (1,024 for Sonnet 4.6, 4,096 for Opus 4.7)
- Convert
systemparameter from string to structured content block format - Add
cache_control: {"type": "ephemeral"}to each static block - Deploy monitoring that logs
cache_read_input_tokensandcache_creation_input_tokens - Verify cache reads are occurring on the second request onward
- Compare daily API cost before and after for one week
Measuring Impact
Track these metrics starting from day one:
- Cache activation rate: Percentage of requests showing
cache_read_input_tokens > 0(target: 95%+) - Daily input cost: Should drop 80-90% within the first 24 hours of enabling caching
- Cost per request: Track the moving average. Expected: $0.015 per request (Sonnet, 50K cached) vs $0.15 without caching
- Set up a weekly cost report comparing pre-caching baseline to current spend