Claude Cache Minimum Token Requirements 2026

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Claude prompt caching silently fails if your content falls below the minimum token threshold. Opus 4.7 requires 4,096 tokens. Sonnet 4.6 requires 1,024 tokens. Below these limits, your cache_control breakpoints are ignored, and you pay the full $5.00/MTok or $3.00/MTok on every request with zero savings.

The Setup

You have a 3,000-token system prompt running on Opus 4.7 with 500 requests per day. You add cache_control breakpoints expecting to save 90% on input costs. But 3,000 tokens is below the 4,096-token minimum for Opus. The caching never activates. You pay full price on all 500 requests – $7.50/day instead of the $0.88 you expected.

The fix is either to pad the prompt above the threshold (adding useful context, not filler) or to switch to Sonnet 4.6, which has a 1,024-token minimum. On Sonnet, the same 3,000-token prompt caches successfully, and 500 requests/day drops from $4.50 to $0.53.

The Math

Opus 4.7, 3,000-token prompt (below 4,096 minimum), 500 requests/day:

Attempted with caching (but below threshold – caching silently disabled):

Same prompt on Sonnet 4.6 (above 1,024 minimum – caching works):

Savings from choosing the right model: $211.20/month

Full minimum token requirements table:

Model Minimum Cacheable Tokens
Opus 4.7 4,096
Opus 4.6 4,096
Opus 4.5 4,096
Opus 4.1 / Opus 4 1,024
Sonnet 4.6 / 4.5 / 3.7 1,024
Haiku 4.5 4,096
Haiku 3.5 2,048
Haiku 3 2,048

The Technique

The API does not return an error when your content is below the minimum. It simply processes the request without caching. You must verify caching is active by checking the response usage fields.

import anthropic

client = anthropic.Anthropic()

def verify_caching(model: str, system_text: str) -> dict:
    """Test whether caching activates for a given prompt and model."""

    # First request -- should trigger cache write
    response = client.messages.create(
        model=model,
        max_tokens=50,
        system=[
            {
                "type": "text",
                "text": system_text,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": "Hello."}]
    )

    usage = response.usage
    cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
    cache_read = getattr(usage, 'cache_read_input_tokens', 0)
    uncached = usage.input_tokens

    status = "ACTIVE" if cache_write > 0 else "INACTIVE"
    return {
        "model": model,
        "cache_status": status,
        "cache_write_tokens": cache_write,
        "uncached_tokens": uncached,
        "prompt_tokens_estimated": len(system_text.split()) * 1.3
    }

# Test the same prompt on two models
prompt = open("system_prompt.txt").read()

result_opus = verify_caching("claude-opus-4-7-20250415", prompt)
result_sonnet = verify_caching("claude-sonnet-4-6-20250929", prompt)

print(f"Opus: {result_opus['cache_status']} "
      f"(~{result_opus['prompt_tokens_estimated']:.0f} tokens)")
print(f"Sonnet: {result_sonnet['cache_status']} "
      f"(~{result_sonnet['prompt_tokens_estimated']:.0f} tokens)")

If your prompt falls below the threshold, here are three options ranked by effectiveness:

Option 1: Add useful context to exceed the threshold. If your system prompt is 3,000 tokens on Opus, add 1,100+ tokens of relevant few-shot examples, domain terminology, or formatting instructions. This improves output quality while enabling caching.

Option 2: Switch models. Move from Opus 4.7 (4,096 minimum) to Sonnet 4.6 (1,024 minimum). For many workloads, Sonnet produces comparable quality at lower base pricing ($3.00 vs $5.00/MTok) and activates caching on smaller prompts.

Option 3: Combine multiple cacheable blocks. If you have a 2,000-token system prompt and a 3,000-token few-shot block, combine them under a single breakpoint. The 5,000-token combined block exceeds the 4,096 threshold for Opus.

# Count tokens in your system prompt using the Anthropic tokenizer
python3 -c "
import anthropic
client = anthropic.Anthropic()
text = open('system_prompt.txt').read()
count = client.count_tokens(text)
thresholds = {'opus-4.7': 4096, 'sonnet-4.6': 1024, 'haiku-4.5': 4096}
print(f'Token count: {count}')
for model, min_t in thresholds.items():
    status = 'OK' if count >= min_t else 'BELOW MINIMUM'
    print(f'  {model}: {status} (need {min_t})')
"

The Tradeoffs

Working around minimum token requirements involves compromises:

Implementation Checklist

  1. Count tokens in every cacheable content block using the Anthropic tokenizer
  2. Compare counts against the minimum threshold for your target model
  3. If below threshold: add useful context, switch models, or combine blocks
  4. Deploy with monitoring on cache_creation_input_tokens
  5. If cache_creation_input_tokens is 0 on the first request, caching is not activating
  6. Re-verify after any prompt changes that might reduce token count

Measuring Impact

After resolving threshold issues: