When to Use Claude Batch vs Real-Time API
The Claude Batch API saves 50% on every token but takes up to 1 hour to return results. The real-time API costs twice as much but responds in seconds. Choosing wrong either wastes $562.50/week on unnecessary premium pricing or delays time-sensitive responses by an hour. Here is how to decide.
The Setup
You manage three Claude-powered features: a live chat assistant (needs sub-second responses), a nightly content pipeline (generates 500 articles), and a code review system (reviews PRs within 2 hours of submission).
The chat assistant must use real-time API – there is no alternative. The content pipeline is a clear batch candidate. The code review sits in the gray zone: 2 hours of acceptable latency versus 1 hour of batch processing time.
Current spend: $3,200/month across all three. Migrating the right workloads to batch saves $1,100/month without degrading any user experience.
The Math
Three workloads, Sonnet 4.6:
Live chat (real-time only):
- 200K requests/month x 3K input + 1K output avg
- Input: 600M x $3.00/MTok = $1,800
- Output: 200M x $15.00/MTok = $3,000
- Total: $4,800/month (cannot batch)
Content pipeline (batch candidate):
- 15,000 articles/month x 5K input + 3K output
- Standard: $225 input + $675 output = $900/month
- Batch: $112.50 + $337.50 = $450/month
- Savings: $450/month
Code review (batch candidate):
- 3,000 PRs/month x 20K input + 5K output
- Standard: $180 input + $225 output = $405/month
- Batch: $90 + $112.50 = $202.50/month
- Savings: $202.50/month
Total monthly savings from batch migration: $652.50
The Technique
Use this decision matrix to classify each workload:
def should_use_batch(
acceptable_latency_minutes: int,
requests_per_day: int,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "claude-sonnet-4-6-20250929"
) -> dict:
"""Determine whether to use batch or real-time API."""
# Pricing lookup (standard vs batch)
prices = {
"claude-opus-4-7-20250415": {
"std_in": 5.00, "std_out": 25.00,
"batch_in": 2.50, "batch_out": 12.50
},
"claude-sonnet-4-6-20250929": {
"std_in": 3.00, "std_out": 15.00,
"batch_in": 1.50, "batch_out": 7.50
},
"claude-haiku-4-5-20251001": {
"std_in": 1.00, "std_out": 5.00,
"batch_in": 0.50, "batch_out": 2.50
}
}
p = prices[model]
daily_tokens_in = requests_per_day * avg_input_tokens
daily_tokens_out = requests_per_day * avg_output_tokens
std_daily = (daily_tokens_in * p["std_in"] +
daily_tokens_out * p["std_out"]) / 1e6
batch_daily = (daily_tokens_in * p["batch_in"] +
daily_tokens_out * p["batch_out"]) / 1e6
savings_daily = std_daily - batch_daily
savings_monthly = savings_daily * 30
# Decision logic
can_batch = acceptable_latency_minutes >= 60
worth_batching = savings_monthly > 10 # Minimum $10/mo savings
recommendation = "BATCH" if (can_batch and worth_batching) else "REAL-TIME"
return {
"recommendation": recommendation,
"real_time_monthly": f"${std_daily * 30:.2f}",
"batch_monthly": f"${batch_daily * 30:.2f}",
"monthly_savings": f"${savings_monthly:.2f}",
"reason": (
f"Latency allows batch ({acceptable_latency_minutes}min > 60min) "
f"and saves ${savings_monthly:.2f}/month"
if recommendation == "BATCH"
else f"Latency too tight ({acceptable_latency_minutes}min)"
if not can_batch
else f"Savings too small (${savings_monthly:.2f}/month)"
)
}
# Evaluate three workloads
workloads = [
{"name": "Live chat", "latency": 1, "rpd": 6667, "inp": 3000, "out": 1000},
{"name": "Content gen", "latency": 1440, "rpd": 500, "inp": 5000, "out": 3000},
{"name": "Code review", "latency": 120, "rpd": 100, "inp": 20000, "out": 5000},
]
for w in workloads:
result = should_use_batch(w["latency"], w["rpd"], w["inp"], w["out"])
print(f"{w['name']}: {result['recommendation']} "
f"(saves {result['monthly_savings']}/mo)")
For workloads in the gray zone (latency between 30-120 minutes), consider a hybrid approach:
# Hybrid: use real-time for urgent requests, batch for the rest
python3 -c "
# Simulate priority-based routing
import json
requests = [json.loads(l) for l in open('daily_requests.jsonl')]
urgent = [r for r in requests if r.get('priority') == 'high']
normal = [r for r in requests if r.get('priority') != 'high']
pct_urgent = len(urgent) / len(requests) * 100
pct_normal = len(normal) / len(requests) * 100
print(f'Urgent (real-time): {len(urgent)} ({pct_urgent:.0f}%)')
print(f'Normal (batch): {len(normal)} ({pct_normal:.0f}%)')
print(f'Savings: {pct_normal * 0.5:.0f}% of total cost moved to 50% discount')
"
Typical routing rules:
- User-facing, interactive: Always real-time
- Background processing, reporting: Always batch
- Internal tools with flexible SLA: Batch during off-hours, real-time during business hours
- CI/CD integration: Batch if PR review SLA exceeds 1 hour, real-time otherwise
The Tradeoffs
Batch migration introduces operational complexity:
- Queue management: Batch requests are fire-and-forget. You need polling or webhook infrastructure to handle completion.
- Error handling changes: Real-time errors are synchronous. Batch errors appear in results after processing, requiring different retry patterns.
- Capacity planning: Batches compete for the same API capacity. Submitting 100K requests during peak hours may extend processing time beyond the typical 1 hour.
- Testing difficulty: Developing against the batch API is slower because every test requires polling. Use real-time API for development and batch for production.
Implementation Checklist
- List all Claude API workloads with their latency requirements
- Run the decision function above for each workload
- Migrate clear batch candidates first (latency > 60 minutes)
- Build polling infrastructure for batch result retrieval
- Implement priority-based routing for gray-zone workloads
- Monitor batch processing times for 2 weeks before migrating additional workloads
- Set up cost tracking per workload to verify 50% savings
Measuring Impact
Track per-workload metrics after migration:
- Cost per request by mode: Real-time vs batch for the same workload type. Should show exactly 50% reduction.
- Batch completion time p50/p95: Median and 95th percentile processing time. Alert if p95 exceeds 2 hours.
- Migration coverage: Percentage of total API spend routed through batch. Target: 40-60% for a typical mixed workload.
- Review monthly billing to confirm projected savings are materializing