Smart Model Selection Saves 80% on Claude API
A pipeline running 1,000 daily requests on Claude Opus 4.7 costs $75/day. Routing those same requests to Haiku 4.5 where possible, adding batch processing for non-urgent work, and caching repeated system prompts drops the daily cost to $11.25 — an 85% reduction to $337.50/month from $2,250/month.
The Setup
Most Claude API cost reduction guides cover one technique at a time. This guide stacks three techniques together — model routing, batch processing, and prompt caching — for maximum compound savings.
The key insight: these techniques multiply rather than add. Batch processing gives 50% off. Cache reads give 90% off input tokens. Haiku costs 80% less than Opus. Combined, the cheapest path (Haiku batch with cached inputs) costs 95% less than the most expensive path (Opus standard without caching).
Target: teams spending $1,000-$10,000/month on Claude API who want to maintain quality while cutting costs by 70-85%.
The Math
Baseline: 1,000 requests/day, 5K input + 2K output tokens, 2K of input is a repeated system prompt. All Opus 4.7, no caching, no batching.
Before (Opus, standard, no cache):
- Input: 5M * $5.00/MTok = $25.00
- Output: 2M * $25.00/MTok = $50.00
- Daily: $75.00 -> $2,250/month
Layer 1 — Model routing (70% to Haiku):
- Haiku: 700 * (5K * $1 + 2K * $5) / 1M = $10.50
- Opus: 300 * (5K * $5 + 2K * $25) / 1M = $22.50
- Daily: $33.00 -> $990/month (56% savings)
Layer 2 — Add batch processing for 50% of Haiku requests:
- Haiku batch: 350 * (5K * $0.50 + 2K * $2.50) / 1M = $2.625
- Haiku standard: 350 * $0.015 = $5.25
- Opus standard: $22.50
- Daily: $30.375 -> $911/month (60% savings)
Layer 3 — Cache the 2K system prompt:
- Cached input saves 90% on 2K tokens per request
- Haiku cache read: $0.10/MTok (vs $1.00 uncached)
- 1,000 requests * 2K cached * $0.10/MTok = $0.20/day (vs $2.00 uncached)
- Net savings: $1.80/day
- Final daily: $28.575 -> $857/month (62% total savings)
Aggressive version (90% Haiku, 80% batched, all cached):
- $11.25/day -> $337.50/month (85% total savings)
The Technique
Implement all three optimization layers in a single API wrapper:
import anthropic
from typing import Optional
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a helpful assistant. Follow instructions precisely.
Provide structured, concise responses. Use the specified output format."""
# Pricing rates per million tokens
RATES = {
"claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.5},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3},
"claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0, "cache_read": 0.1},
}
def optimized_request(
prompt: str,
task_type: str = "general",
urgent: bool = True,
system: str = SYSTEM_PROMPT,
max_tokens: int = 2048,
) -> dict:
"""Make a cost-optimized Claude API request.
Applies three optimization layers:
1. Model routing based on task complexity
2. Prompt caching for repeated system prompts
3. Batch collection for non-urgent requests
"""
# Layer 1: Model selection
simple_tasks = {"classify", "extract", "detect", "format", "label"}
if task_type in simple_tasks:
model = "claude-haiku-4-5-20251001"
elif task_type in {"summarize", "draft", "review", "explain"}:
model = "claude-sonnet-4-6"
else:
model = "claude-opus-4-7"
# Layer 2: Prompt caching (cache the system prompt)
system_with_cache = [
{
"type": "text",
"text": system,
"cache_control": {"type": "ephemeral"},
}
]
# Layer 3: Batch collection (non-urgent requests)
if not urgent:
return collect_for_batch(model, prompt, system, max_tokens)
response = client.messages.create(
model=model,
max_tokens=max_tokens,
system=system_with_cache,
messages=[{"role": "user", "content": prompt}],
)
rates = RATES[model]
cost = (
response.usage.input_tokens * rates["input"] / 1_000_000
+ response.usage.output_tokens * rates["output"] / 1_000_000
)
# Check if cache was used
cache_stats = {}
if hasattr(response.usage, "cache_read_input_tokens"):
cached = response.usage.cache_read_input_tokens
cache_savings = cached * (rates["input"] - rates["cache_read"]) / 1_000_000
cache_stats = {"cached_tokens": cached, "cache_savings": cache_savings}
return {
"model": model,
"content": response.content[0].text,
"cost": round(cost, 6),
**cache_stats,
}
_batch_queue = []
def collect_for_batch(model: str, prompt: str, system: str, max_tokens: int) -> dict:
"""Queue a request for batch processing (50% discount)."""
_batch_queue.append({
"model": model,
"prompt": prompt,
"system": system,
"max_tokens": max_tokens,
})
return {"status": "queued", "queue_size": len(_batch_queue)}
def submit_batch():
"""Submit collected requests as a batch for 50% discount."""
if not _batch_queue:
return {"status": "empty"}
requests = []
for i, req in enumerate(_batch_queue):
requests.append({
"custom_id": f"req_{i}",
"params": {
"model": req["model"],
"max_tokens": req["max_tokens"],
"system": req["system"],
"messages": [{"role": "user", "content": req["prompt"]}],
},
})
batch = client.messages.batches.create(requests=requests)
_batch_queue.clear()
return {"batch_id": batch.id, "request_count": len(requests)}
# Usage
result = optimized_request(
prompt="Classify: 'The server returned a 500 error' -> technical or billing?",
task_type="classify",
urgent=True,
)
print(f"Model: {result['model']} | Cost: ${result['cost']:.6f}")
# Non-urgent request queued for batch
result = optimized_request(
prompt="Extract all dates from: 'Meeting on Jan 5, deadline March 15'",
task_type="extract",
urgent=False,
)
print(f"Status: {result['status']} | Queue: {result['queue_size']}")
The Tradeoffs
Stacking optimizations adds complexity. Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5 (1,024 for Sonnet 4.6). Short system prompts may not meet this threshold.
Batch processing introduces latency — most batches complete within 1 hour, with a 24-hour maximum. Any request requiring real-time response cannot be batched.
The 5-minute cache TTL means you need consistent request flow to keep the cache warm. Sporadic usage patterns will see frequent cache misses, reducing the 90% cache discount to near zero effective savings.
Implementation Checklist
- Start with model routing alone (easiest, biggest impact)
- Add prompt caching for your most common system prompt
- Identify non-urgent request types eligible for batching
- Implement batch collection with a 30-minute submission window
- Monitor cost reduction per layer to verify each technique contributes
- Tune routing rules based on quality feedback
- Review combined savings monthly
Measuring Impact
Track the contribution of each optimization layer separately. Target savings: model routing 40-60%, caching 10-20% additional, batching 10-25% additional. Total target: 60-85% reduction from baseline. The best metric is cost per useful output — divide total monthly spend by the number of requests that produced acceptable quality results.
Related Guides
- Why Is Claude Code Expensive — the pricing dynamics that make stacked optimization effective
- Claude Code Token Usage Optimization — token reduction techniques that complement model routing
- Claude Code Monthly Cost Breakdown — baseline cost data for planning your optimization strategy