Smart Model Selection Saves 80% (2026)

Last updated: April 19, 2026

A pipeline running 1,000 daily requests on Claude Opus 4.7 costs $75/day. Routing those same requests to Haiku 4.5 where possible, adding batch processing for non-urgent work, and caching repeated system prompts drops the daily cost to $11.25 — an 85% reduction to $337.50/month from $2,250/month.

The Setup

Most Claude API cost reduction guides cover one technique at a time. This guide stacks three techniques together — model routing, batch processing, and prompt caching — for maximum compound savings.

The key insight: these techniques multiply rather than add. Batch processing gives 50% off. Cache reads give 90% off input tokens. Haiku costs 80% less than Opus. Combined, the cheapest path (Haiku batch with cached inputs) costs 95% less than the most expensive path (Opus standard without caching).

Target: teams spending $1,000-$10,000/month on Claude API who want to maintain quality while cutting costs by 70-85%.

The Math

Baseline: 1,000 requests/day, 5K input + 2K output tokens, 2K of input is a repeated system prompt. All Opus 4.7, no caching, no batching.

Before (Opus, standard, no cache):

Input: 5M * $5.00/MTok = $25.00
Output: 2M * $25.00/MTok = $50.00
Daily: $75.00 -> $2,250/month

Layer 1 — Model routing (70% to Haiku):

Haiku: 700 * (5K * $1 + 2K * $5) / 1M = $10.50
Opus: 300 * (5K * $5 + 2K * $25) / 1M = $22.50
Daily: $33.00 -> $990/month (56% savings)

Layer 2 — Add batch processing for 50% of Haiku requests:

Haiku batch: 350 * (5K * $0.50 + 2K * $2.50) / 1M = $2.625
Haiku standard: 350 * $0.015 = $5.25
Opus standard: $22.50
Daily: $30.375 -> $911/month (60% savings)

Layer 3 — Cache the 2K system prompt:

Cached input saves 90% on 2K tokens per request
Haiku cache read: $0.10/MTok (vs $1.00 uncached)
1,000 requests * 2K cached * $0.10/MTok = $0.20/day (vs $2.00 uncached)
Net savings: $1.80/day
Final daily: $28.575 -> $857/month (62% total savings)

Aggressive version (90% Haiku, 80% batched, all cached):

$11.25/day -> $337.50/month (85% total savings)

The Technique

Implement all three optimization layers in a single API wrapper:

import anthropic
from typing import Optional
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a helpful assistant. Follow instructions precisely.
Provide structured, concise responses. Use the specified output format."""
# Pricing rates per million tokens
RATES = {
    "claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.5},
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3},
    "claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0, "cache_read": 0.1},
}
def optimized_request(
    prompt: str,
    task_type: str = "general",
    urgent: bool = True,
    system: str = SYSTEM_PROMPT,
    max_tokens: int = 2048,
) -> dict:
    """Make a cost-optimized Claude API request.
    Applies three optimization layers:
    1. Model routing based on task complexity
    2. Prompt caching for repeated system prompts
    3. Batch collection for non-urgent requests
    """
    # Layer 1: Model selection
    simple_tasks = {"classify", "extract", "detect", "format", "label"}
    if task_type in simple_tasks:
        model = "claude-haiku-4-5-20251001"
    elif task_type in {"summarize", "draft", "review", "explain"}:
        model = "claude-sonnet-4-6"
    else:
        model = "claude-opus-4-7"
    # Layer 2: Prompt caching (cache the system prompt)
    system_with_cache = [
        {
            "type": "text",
            "text": system,
            "cache_control": {"type": "ephemeral"},
        }
    ]
    # Layer 3: Batch collection (non-urgent requests)
    if not urgent:
        return collect_for_batch(model, prompt, system, max_tokens)
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system_with_cache,
        messages=[{"role": "user", "content": prompt}],
    )
    rates = RATES[model]
    cost = (
        response.usage.input_tokens * rates["input"] / 1_000_000
        + response.usage.output_tokens * rates["output"] / 1_000_000
    )
    # Check if cache was used
    cache_stats = {}
    if hasattr(response.usage, "cache_read_input_tokens"):
        cached = response.usage.cache_read_input_tokens
        cache_savings = cached * (rates["input"] - rates["cache_read"]) / 1_000_000
        cache_stats = {"cached_tokens": cached, "cache_savings": cache_savings}
    return {
        "model": model,
        "content": response.content[0].text,
        "cost": round(cost, 6),
        **cache_stats,
    }
_batch_queue = []
def collect_for_batch(model: str, prompt: str, system: str, max_tokens: int) -> dict:
    """Queue a request for batch processing (50% discount)."""
    _batch_queue.append({
        "model": model,
        "prompt": prompt,
        "system": system,
        "max_tokens": max_tokens,
    })
    return {"status": "queued", "queue_size": len(_batch_queue)}
def submit_batch():
    """Submit collected requests as a batch for 50% discount."""
    if not _batch_queue:
        return {"status": "empty"}
    requests = []
    for i, req in enumerate(_batch_queue):
        requests.append({
            "custom_id": f"req_{i}",
            "params": {
                "model": req["model"],
                "max_tokens": req["max_tokens"],
                "system": req["system"],
                "messages": [{"role": "user", "content": req["prompt"]}],
            },
        })
    batch = client.messages.batches.create(requests=requests)
    _batch_queue.clear()
    return {"batch_id": batch.id, "request_count": len(requests)}
# Usage
result = optimized_request(
    prompt="Classify: 'The server returned a 500 error' -> technical or billing?",
    task_type="classify",
    urgent=True,
)
print(f"Model: {result['model']} | Cost: ${result['cost']:.6f}")
# Non-urgent request queued for batch
result = optimized_request(
    prompt="Extract all dates from: 'Meeting on Jan 5, deadline March 15'",
    task_type="extract",
    urgent=False,
)
print(f"Status: {result['status']} | Queue: {result['queue_size']}")

The Tradeoffs

Stacking optimizations adds complexity. Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5 (1,024 for Sonnet 4.6). Short system prompts may not meet this threshold.

Batch processing introduces latency — most batches complete within 1 hour, with a 24-hour maximum. Any request requiring real-time response cannot be batched.

The 5-minute cache TTL means you need consistent request flow to keep the cache warm. Sporadic usage patterns will see frequent cache misses, reducing the 90% cache discount to near zero effective savings.

Implementation Checklist

Start with model routing alone (easiest, biggest impact)
Add prompt caching for your most common system prompt
Identify non-urgent request types eligible for batching
Implement batch collection with a 30-minute submission window
Monitor cost reduction per layer to verify each technique contributes
Tune routing rules based on quality feedback
Review combined savings monthly

Measuring Impact

Track the contribution of each optimization layer separately. Target savings: model routing 40-60%, caching 10-20% additional, batching 10-25% additional. Total target: 60-85% reduction from baseline. The best metric is cost per useful output — divide total monthly spend by the number of requests that produced acceptable quality results.

Which model? → Take the 5-question quiz in our Model Selector.

Estimate tokens → Calculate your usage with our Token Estimator.

Try it: Estimate your monthly spend with our Cost Calculator.

Why Is Claude Code Expensive — the pricing dynamics that make stacked optimization effective
Claude Code Token Usage Optimization — token reduction techniques that complement model routing
Claude Code Monthly Cost Breakdown — baseline cost data for planning your optimization strategy

Smart Model Selection Saves 80% (2026)

The Setup

The Math

The Technique

The Tradeoffs

Implementation Checklist

Measuring Impact

See Also

About the Author

The Setup

The Math

The Technique

The Tradeoffs

Implementation Checklist

Measuring Impact

Related Guides

See Also

About the Author

Related Guides