Claude Prompt Caching Not Working Fix

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Prompt caching can cut your Claude API costs by up to 90% on repeated content, but it silently does nothing if your cached content is too short or your breakpoints are misconfigured. This guide covers every reason caching fails.

The Error

Prompt caching does not produce an error when it fails. Instead, your usage response shows zero cached tokens:

{
  "usage": {
    "input_tokens": 5000,
    "output_tokens": 200,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0
  }
}

If cache_creation_input_tokens and cache_read_input_tokens are both 0, caching is not working.

Quick Fix

  1. Verify your cached content meets the minimum token threshold for your model.
  2. Use automatic caching (cache_control at top level) for the simplest setup.
  3. Check that tool definitions have not changed between requests (this invalidates all caches).

What Causes This

Prompt caching fails silently for several reasons:

Full Solution

The simplest approach – add cache_control at the top level and the system places the breakpoint automatically on the last cacheable block:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are a legal expert. Here is the full text of the contract...[4000+ tokens of content]",
    messages=[{"role": "user", "content": "Summarize the key terms."}]
)

# Check cache usage
print(f"Cache created: {response.usage.cache_creation_input_tokens} tokens")
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")

Use Explicit Breakpoints

Place cache_control on specific content blocks. You can use up to 4 breakpoints per request:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal expert specializing in contract law...[large system prompt]",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What are the termination clauses?"}
    ]
)

Minimum Cacheable Length by Model

Your cached content must meet these minimums or caching silently does nothing:

Model Minimum Tokens
Claude Opus 4.6 4,096
Claude Opus 4.5 4,096
Claude Haiku 4.5 4,096
Claude Sonnet 4.6 2,048
Claude Haiku 3.5 2,048
Claude Sonnet 4.5 1,024
Claude Opus 4.1 1,024
Claude Opus 4 1,024
Claude Sonnet 4 1,024

Avoid Cache Invalidation

Tool definition changes invalidate ALL cached content. Keep your tools stable across requests:

# Define tools ONCE and reuse
TOOLS = [
    {
        "name": "get_weather",
        "description": "Get the weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
]

# Both requests will share the cache because tools are identical
response1 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    tools=TOOLS,
    messages=[{"role": "user", "content": "What's the weather in NYC?"}]
)

response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    tools=TOOLS,  # Same tools -- cache hit!
    messages=[{"role": "user", "content": "What's the weather in LA?"}]
)

Use 1-Hour Cache for Extended Workflows

For batch processing or extended thinking tasks, use the 1-hour TTL to improve hit rates:

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    cache_control={"type": "ephemeral", "ttl": "1h"},
    system="...[large system prompt]...",
    messages=[{"role": "user", "content": "Analyze this data"}]
)

The 1-hour cache costs 2x the base input price to write but 0.1x to read, making it worthwhile for workloads with many reads.

Prevention

  1. Monitor cache hit rates: Always check cache_creation_input_tokens and cache_read_input_tokens in every response.
  2. Keep tools stable: Define tools once and pass the same definition across all requests.
  3. Place cached content first: System prompts and tool definitions at the start of the prompt are the best cache candidates.
  4. Right-size your model: If your system prompt is 2,000 tokens, use Sonnet 4.5 or 4 (1,024 minimum) rather than Opus 4.6 (4,096 minimum).