Smart Model Selection Saves 80% on Claude API

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

A pipeline running 1,000 daily requests on Claude Opus 4.7 costs $75/day. Routing those same requests to Haiku 4.5 where possible, adding batch processing for non-urgent work, and caching repeated system prompts drops the daily cost to $11.25 — an 85% reduction to $337.50/month from $2,250/month.

The Setup

Most Claude API cost reduction guides cover one technique at a time. This guide stacks three techniques together — model routing, batch processing, and prompt caching — for maximum compound savings.

The key insight: these techniques multiply rather than add. Batch processing gives 50% off. Cache reads give 90% off input tokens. Haiku costs 80% less than Opus. Combined, the cheapest path (Haiku batch with cached inputs) costs 95% less than the most expensive path (Opus standard without caching).

Target: teams spending $1,000-$10,000/month on Claude API who want to maintain quality while cutting costs by 70-85%.

The Math

Baseline: 1,000 requests/day, 5K input + 2K output tokens, 2K of input is a repeated system prompt. All Opus 4.7, no caching, no batching.

Before (Opus, standard, no cache):

Layer 1 — Model routing (70% to Haiku):

Layer 2 — Add batch processing for 50% of Haiku requests:

Layer 3 — Cache the 2K system prompt:

Aggressive version (90% Haiku, 80% batched, all cached):

The Technique

Implement all three optimization layers in a single API wrapper:

import anthropic
from typing import Optional

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a helpful assistant. Follow instructions precisely.
Provide structured, concise responses. Use the specified output format."""

# Pricing rates per million tokens
RATES = {
    "claude-opus-4-7": {"input": 5.0, "output": 25.0, "cache_read": 0.5},
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3},
    "claude-haiku-4-5-20251001": {"input": 1.0, "output": 5.0, "cache_read": 0.1},
}

def optimized_request(
    prompt: str,
    task_type: str = "general",
    urgent: bool = True,
    system: str = SYSTEM_PROMPT,
    max_tokens: int = 2048,
) -> dict:
    """Make a cost-optimized Claude API request.

    Applies three optimization layers:
    1. Model routing based on task complexity
    2. Prompt caching for repeated system prompts
    3. Batch collection for non-urgent requests
    """
    # Layer 1: Model selection
    simple_tasks = {"classify", "extract", "detect", "format", "label"}
    if task_type in simple_tasks:
        model = "claude-haiku-4-5-20251001"
    elif task_type in {"summarize", "draft", "review", "explain"}:
        model = "claude-sonnet-4-6"
    else:
        model = "claude-opus-4-7"

    # Layer 2: Prompt caching (cache the system prompt)
    system_with_cache = [
        {
            "type": "text",
            "text": system,
            "cache_control": {"type": "ephemeral"},
        }
    ]

    # Layer 3: Batch collection (non-urgent requests)
    if not urgent:
        return collect_for_batch(model, prompt, system, max_tokens)

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system_with_cache,
        messages=[{"role": "user", "content": prompt}],
    )

    rates = RATES[model]
    cost = (
        response.usage.input_tokens * rates["input"] / 1_000_000
        + response.usage.output_tokens * rates["output"] / 1_000_000
    )

    # Check if cache was used
    cache_stats = {}
    if hasattr(response.usage, "cache_read_input_tokens"):
        cached = response.usage.cache_read_input_tokens
        cache_savings = cached * (rates["input"] - rates["cache_read"]) / 1_000_000
        cache_stats = {"cached_tokens": cached, "cache_savings": cache_savings}

    return {
        "model": model,
        "content": response.content[0].text,
        "cost": round(cost, 6),
        **cache_stats,
    }

_batch_queue = []

def collect_for_batch(model: str, prompt: str, system: str, max_tokens: int) -> dict:
    """Queue a request for batch processing (50% discount)."""
    _batch_queue.append({
        "model": model,
        "prompt": prompt,
        "system": system,
        "max_tokens": max_tokens,
    })
    return {"status": "queued", "queue_size": len(_batch_queue)}

def submit_batch():
    """Submit collected requests as a batch for 50% discount."""
    if not _batch_queue:
        return {"status": "empty"}

    requests = []
    for i, req in enumerate(_batch_queue):
        requests.append({
            "custom_id": f"req_{i}",
            "params": {
                "model": req["model"],
                "max_tokens": req["max_tokens"],
                "system": req["system"],
                "messages": [{"role": "user", "content": req["prompt"]}],
            },
        })

    batch = client.messages.batches.create(requests=requests)
    _batch_queue.clear()
    return {"batch_id": batch.id, "request_count": len(requests)}

# Usage
result = optimized_request(
    prompt="Classify: 'The server returned a 500 error' -> technical or billing?",
    task_type="classify",
    urgent=True,
)
print(f"Model: {result['model']} | Cost: ${result['cost']:.6f}")

# Non-urgent request queued for batch
result = optimized_request(
    prompt="Extract all dates from: 'Meeting on Jan 5, deadline March 15'",
    task_type="extract",
    urgent=False,
)
print(f"Status: {result['status']} | Queue: {result['queue_size']}")

The Tradeoffs

Stacking optimizations adds complexity. Prompt caching requires minimum 4,096 cacheable tokens for Opus 4.7 and Haiku 4.5 (1,024 for Sonnet 4.6). Short system prompts may not meet this threshold.

Batch processing introduces latency — most batches complete within 1 hour, with a 24-hour maximum. Any request requiring real-time response cannot be batched.

The 5-minute cache TTL means you need consistent request flow to keep the cache warm. Sporadic usage patterns will see frequent cache misses, reducing the 90% cache discount to near zero effective savings.

Implementation Checklist

  1. Start with model routing alone (easiest, biggest impact)
  2. Add prompt caching for your most common system prompt
  3. Identify non-urgent request types eligible for batching
  4. Implement batch collection with a 30-minute submission window
  5. Monitor cost reduction per layer to verify each technique contributes
  6. Tune routing rules based on quality feedback
  7. Review combined savings monthly

Measuring Impact

Track the contribution of each optimization layer separately. Target savings: model routing 40-60%, caching 10-20% additional, batching 10-25% additional. Total target: 60-85% reduction from baseline. The best metric is cost per useful output — divide total monthly spend by the number of requests that produced acceptable quality results.