When NOT to Use Claude Prompt Caching
Prompt caching is not free. Every cache write costs 1.25x the standard input price on 5-minute TTL and 2.0x on 1-hour TTL. If your cache never gets read, you pay that premium with zero return. A system that writes cache entries without reading them spends 25% more than one with no caching at all.
The Setup
You run a content generation service where each request has a unique system prompt tailored to the specific brand, voice, and product being written about. No two requests share the same prompt. You enabled prompt caching across all endpoints expecting savings.
Instead of saving money, you added 25% to every input cost. On Opus 4.7, your effective input price went from $5.00/MTok to $6.25/MTok – the cache write premium – because every request triggered a write and none triggered a read. At 1,000 requests/day with 10,000-token prompts, that is an extra $12.50/day or $375/month in pure waste.
The Math
Scenario: Unique prompts, no cache reuse, Opus 4.7, 10K tokens, 1,000 requests/day:
Without caching:
- 1,000 x 10,000 tokens x $5.00/MTok = $50.00/day
With caching enabled (100% write, 0% read):
- 1,000 cache writes x 10,000 tokens x $6.25/MTok = $62.50/day
Extra cost: $12.50/day ($375/month) for zero benefit
The break-even threshold requires at least 1 read per write for 5-minute cache (2 total requests per unique prompt). If your hit rate is below 50% (more writes than reads), you are losing money.
1-hour cache on the same workload is worse:
- 1,000 writes x 10,000 tokens x $10.00/MTok = $100.00/day
- Extra cost: $50.00/day ($1,500/month) vs no caching
The Technique
Before enabling caching, audit your prompt reuse patterns. Here is a diagnostic script that analyzes your request logs:
import json
import hashlib
from collections import Counter
def analyze_cache_potential(log_file: str) -> dict:
"""Analyze API request logs to determine caching viability."""
prompt_hashes = []
with open(log_file) as f:
for line in f:
req = json.loads(line)
system_text = req.get("system", "")
if isinstance(system_text, list):
system_text = " ".join(
b["text"] for b in system_text if b.get("type") == "text"
)
prompt_hash = hashlib.sha256(system_text.encode()).hexdigest()[:16]
prompt_hashes.append(prompt_hash)
total_requests = len(prompt_hashes)
unique_prompts = len(set(prompt_hashes))
frequency = Counter(prompt_hashes)
# Prompts used more than once (cacheable)
reusable = sum(1 for count in frequency.values() if count > 1)
reusable_requests = sum(
count for count in frequency.values() if count > 1
)
reuse_ratio = reusable_requests / total_requests if total_requests > 0 else 0
return {
"total_requests": total_requests,
"unique_prompts": unique_prompts,
"reusable_prompts": reusable,
"reuse_ratio": f"{reuse_ratio:.1%}",
"recommendation": (
"ENABLE caching" if reuse_ratio > 0.5
else "DO NOT cache" if reuse_ratio < 0.2
else "SELECTIVE caching (cache only reused prompts)"
)
}
result = analyze_cache_potential("api_requests.jsonl")
print(json.dumps(result, indent=2))
Five scenarios where caching costs more than it saves:
1. Unique prompts per request. Personalized content generation, dynamic RAG prompts with different retrieved chunks, or A/B testing multiple prompt variants. Every request triggers a write with no reads.
2. Low-frequency endpoints. An internal tool used 3 times per day with 15-minute gaps between requests. The 5-minute cache expires before the next request. Each request is a write.
3. Rapidly iterating prompts. During prompt engineering, you change the system prompt every few requests. Each version gets cached and immediately abandoned. Switch caching off during development.
4. Small prompts below the minimum threshold. A 500-token system prompt on Opus 4.7 (minimum 4,096) silently skips caching. The cache_control annotation is ignored, but you pay no penalty in this case. The risk is false confidence – you think caching is active when it is not.
5. Output-dominated costs. If your workload generates 10x more output tokens than input tokens, caching the input saves 90% of a small fraction of your bill. A request with 5K input and 50K output on Opus 4.7 costs $0.025 input + $1.25 output. Caching saves at most $0.0225 – 1.8% of the total.
The Tradeoffs
The decision to skip caching has its own costs:
- Missed savings on partial reuse: Some endpoints have mixed traffic – 60% shared prompts, 40% unique. Selective caching (only on the shared prompts) captures savings without the write penalty on unique ones.
- Future workload changes: A product that starts with unique prompts may standardize over time. Revisit the caching decision quarterly.
- Latency benefit lost: Cache reads are faster than full prompt processing. Skipping caching means consistently slower responses, even if the cost difference is minimal.
- Competitor comparison: Other providers do not offer this level of caching granularity. Not using Claude’s caching means not using one of its key cost advantages.
Implementation Checklist
- Run the prompt reuse analysis script on one week of production request logs
- Calculate your prompt reuse ratio (target: above 50% for caching to be worthwhile)
- If reuse ratio is below 20%, do not enable caching
- If reuse ratio is 20-50%, enable caching only on endpoints with high-reuse prompts
- Disable caching on development and testing endpoints where prompts change frequently
- Re-evaluate every month as traffic patterns shift
Measuring Impact
If you suspect caching is costing you money:
- Cache write-to-read ratio: Pull from response usage data. A ratio above 1:1 (more writes than reads) confirms caching is a net cost.
- A/B billing test: Run identical traffic for 48 hours with caching enabled and disabled. Compare total input costs.
- Per-endpoint analysis: Some endpoints may benefit from caching while others do not. Break down cache metrics by endpoint to selectively disable.
- Check your Anthropic dashboard for total cache_creation vs cache_read token volumes