Claude Cache Minimum Token Requirements 2026
Claude prompt caching silently fails if your content falls below the minimum token threshold. Opus 4.7 requires 4,096 tokens. Sonnet 4.6 requires 1,024 tokens. Below these limits, your cache_control breakpoints are ignored, and you pay the full $5.00/MTok or $3.00/MTok on every request with zero savings.
The Setup
You have a 3,000-token system prompt running on Opus 4.7 with 500 requests per day. You add cache_control breakpoints expecting to save 90% on input costs. But 3,000 tokens is below the 4,096-token minimum for Opus. The caching never activates. You pay full price on all 500 requests – $7.50/day instead of the $0.88 you expected.
The fix is either to pad the prompt above the threshold (adding useful context, not filler) or to switch to Sonnet 4.6, which has a 1,024-token minimum. On Sonnet, the same 3,000-token prompt caches successfully, and 500 requests/day drops from $4.50 to $0.53.
The Math
Opus 4.7, 3,000-token prompt (below 4,096 minimum), 500 requests/day:
Attempted with caching (but below threshold – caching silently disabled):
- 500 x 3,000 tokens x $5.00/MTok = $7.50/day = $225/month
Same prompt on Sonnet 4.6 (above 1,024 minimum – caching works):
- 1 write: 3,000 x $3.75/MTok = $0.011
- 499 reads: 499 x 3,000 x $0.30/MTok = $0.449
- Total: $0.46/day = $13.80/month
Savings from choosing the right model: $211.20/month
Full minimum token requirements table:
| Model | Minimum Cacheable Tokens |
|---|---|
| Opus 4.7 | 4,096 |
| Opus 4.6 | 4,096 |
| Opus 4.5 | 4,096 |
| Opus 4.1 / Opus 4 | 1,024 |
| Sonnet 4.6 / 4.5 / 3.7 | 1,024 |
| Haiku 4.5 | 4,096 |
| Haiku 3.5 | 2,048 |
| Haiku 3 | 2,048 |
The Technique
The API does not return an error when your content is below the minimum. It simply processes the request without caching. You must verify caching is active by checking the response usage fields.
import anthropic
client = anthropic.Anthropic()
def verify_caching(model: str, system_text: str) -> dict:
"""Test whether caching activates for a given prompt and model."""
# First request -- should trigger cache write
response = client.messages.create(
model=model,
max_tokens=50,
system=[
{
"type": "text",
"text": system_text,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Hello."}]
)
usage = response.usage
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
uncached = usage.input_tokens
status = "ACTIVE" if cache_write > 0 else "INACTIVE"
return {
"model": model,
"cache_status": status,
"cache_write_tokens": cache_write,
"uncached_tokens": uncached,
"prompt_tokens_estimated": len(system_text.split()) * 1.3
}
# Test the same prompt on two models
prompt = open("system_prompt.txt").read()
result_opus = verify_caching("claude-opus-4-7-20250415", prompt)
result_sonnet = verify_caching("claude-sonnet-4-6-20250929", prompt)
print(f"Opus: {result_opus['cache_status']} "
f"(~{result_opus['prompt_tokens_estimated']:.0f} tokens)")
print(f"Sonnet: {result_sonnet['cache_status']} "
f"(~{result_sonnet['prompt_tokens_estimated']:.0f} tokens)")
If your prompt falls below the threshold, here are three options ranked by effectiveness:
Option 1: Add useful context to exceed the threshold. If your system prompt is 3,000 tokens on Opus, add 1,100+ tokens of relevant few-shot examples, domain terminology, or formatting instructions. This improves output quality while enabling caching.
Option 2: Switch models. Move from Opus 4.7 (4,096 minimum) to Sonnet 4.6 (1,024 minimum). For many workloads, Sonnet produces comparable quality at lower base pricing ($3.00 vs $5.00/MTok) and activates caching on smaller prompts.
Option 3: Combine multiple cacheable blocks. If you have a 2,000-token system prompt and a 3,000-token few-shot block, combine them under a single breakpoint. The 5,000-token combined block exceeds the 4,096 threshold for Opus.
# Count tokens in your system prompt using the Anthropic tokenizer
python3 -c "
import anthropic
client = anthropic.Anthropic()
text = open('system_prompt.txt').read()
count = client.count_tokens(text)
thresholds = {'opus-4.7': 4096, 'sonnet-4.6': 1024, 'haiku-4.5': 4096}
print(f'Token count: {count}')
for model, min_t in thresholds.items():
status = 'OK' if count >= min_t else 'BELOW MINIMUM'
print(f' {model}: {status} (need {min_t})')
"
The Tradeoffs
Working around minimum token requirements involves compromises:
- Padding prompts risks quality: Adding filler content to reach 4,096 tokens can confuse the model. Only add contextually relevant material.
- Downgrading models risks accuracy: Sonnet 4.6 has a lower threshold but may produce lower quality output on complex reasoning tasks. Test quality before switching.
- Combining blocks reduces flexibility: Merging two content types under one breakpoint means both get invalidated when either changes, increasing write costs.
- Token counting varies: The new tokenizer in Opus 4.7 may use up to 35% more tokens for the same text compared to older models. Content that falls below the threshold on Sonnet 4.5 might exceed it on Opus 4.7.
Implementation Checklist
- Count tokens in every cacheable content block using the Anthropic tokenizer
- Compare counts against the minimum threshold for your target model
- If below threshold: add useful context, switch models, or combine blocks
- Deploy with monitoring on
cache_creation_input_tokens - If
cache_creation_input_tokensis 0 on the first request, caching is not activating - Re-verify after any prompt changes that might reduce token count
Measuring Impact
After resolving threshold issues:
- First-request verification:
cache_creation_input_tokens > 0confirms caching is active - Daily cache write count: Should match the number of unique prompts, not the number of total requests
- Cost comparison: If you padded your prompt, compare the slight increase in per-token cost against the 90% cache read savings
- Run a weekly audit to catch any prompt updates that push content below the minimum threshold