Claude Prompt Caching API Guide
Prompt caching lets you cache repeated portions of your prompts so the API reads them from cache instead of reprocessing. Cache reads cost 10% of the base input token price. This guide shows you how to implement it.
Quick Fix
Add one line to enable automatic caching:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # Enable automatic caching
system="Your large system prompt here...",
messages=[{"role": "user", "content": "Question?"}]
)
What You Need
- Anthropic Python or TypeScript SDK
- A prompt with cacheable content that exceeds the model’s minimum token threshold
Full Solution
Method 1: Automatic Caching (Recommended)
Add cache_control at the top level. The system automatically places the breakpoint on the last cacheable block:
import anthropic
client = anthropic.Anthropic()
# Large system prompt that benefits from caching
SYSTEM_PROMPT = """You are a legal expert. Here is the full text of the contract...
[Insert 4000+ tokens of contract text here]
"""
# First request: creates the cache
response1 = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": "What are the termination clauses?"}]
)
print(f"Cache created: {response1.usage.cache_creation_input_tokens} tokens")
# Second request: reads from cache (10% cost)
response2 = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": "What is the payment schedule?"}]
)
print(f"Cache hit: {response2.usage.cache_read_input_tokens} tokens")
Method 2: Explicit Cache Breakpoints
Place cache_control on specific content blocks for precise control. Up to 4 breakpoints per request:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal expert...[large prompt]...",
"cache_control": {"type": "ephemeral"} # Breakpoint 1
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "[Large document to analyze]...",
"cache_control": {"type": "ephemeral"} # Breakpoint 2
}
]
}
]
)
Cache Lifetime Options
Choose between 5-minute and 1-hour cache durations:
# Default: 5-minute cache (1.25x base input to write, 0.1x to read)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system="...",
messages=[...]
)
# 1-hour cache (2x base input to write, 0.1x to read)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral", "ttl": "1h"},
system="...",
messages=[...]
)
Use 1-hour caching for batch processing or extended thinking tasks where the same context is used across many requests over a longer period.
Minimum Token Thresholds
Your cached content must meet these minimums for the cache to activate:
| Model | Minimum Tokens |
|---|---|
| Claude Opus 4.6, Opus 4.5, Haiku 4.5 | 4,096 |
| Claude Sonnet 4.6, Haiku 3.5 | 2,048 |
| Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4 | 1,024 |
What Can Be Cached
- Tool definitions
- System messages
- Text messages
- Images and documents (PDFs)
- Tool use and tool result blocks
What cannot be cached: thinking blocks directly, sub-content blocks, empty text blocks.
TypeScript Example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
cache_control: { type: "ephemeral" },
system: "Your large system prompt...",
messages: [{ role: "user", content: "Question?" }]
});
console.log(`Cache created: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
Monitor Cache Hit Rates
Always check the usage fields to verify caching is working:
response = client.messages.create(...)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
if usage.cache_read_input_tokens > 0:
total_input = usage.input_tokens + usage.cache_read_input_tokens
hit_rate = usage.cache_read_input_tokens / total_input * 100
print(f"Cache hit rate: {hit_rate:.1f}%")
Cache Invalidation Rules
Understanding what invalidates the cache prevents unexpected cache misses:
- Tool definitions change: Invalidates ALL cached content.
- Thinking parameters change: Invalidates cached messages (but not system prompts or tools).
- Content before breakpoint changes: Any change in content before the cache breakpoint invalidates the cache.
- Cache expires: 5 minutes (default) or 1 hour. The TTL refreshes on each cache read.
Prevention
- Start with automatic caching: It handles breakpoint placement for you. Switch to explicit breakpoints only when you need fine-grained control.
- Cache stable content first: System prompts and tool definitions are ideal candidates because they rarely change.
- Monitor every response: Check
cache_creation_input_tokensandcache_read_input_tokensto verify caching is active. - Keep tools stable: Define tools once and reuse the same definitions across all requests.
Related Guides
- Claude Prompt Caching Not Working – troubleshoot when caching silently fails.
- Claude Prompt Caching Pricing Guide – calculate your cost savings.
- Claude API Error 429 rate_limit_error Fix – cache-read tokens do not count toward ITPM limits.
- Claude Extended Thinking API Guide – use caching with extended thinking.
- Claude Python SDK Getting Started – basic SDK setup before implementing caching.