How to Save 50% on Every Claude API Call
Every Claude API call that does not need an immediate response is eligible for a 50% discount. Sonnet 4.6 drops from $3.00/$15.00 to $1.50/$7.50 per million tokens. Haiku 4.5 drops from $1.00/$5.00 to $0.50/$2.50. The mechanism is the Batch API, and the only trade-off is latency.
The Setup
You run a SaaS product with three backend services powered by Claude. Service A generates weekly email digests (runs Sunday night). Service B scores lead quality from form submissions (checked hourly by sales). Service C moderates user-generated content (needs results within 30 seconds).
Service A and Service B can tolerate batch latency. Service C cannot. By moving A and B to the Batch API, you save 50% on two-thirds of your Claude spend. Service C stays on real-time at full price.
Current monthly bill: $4,200. After migrating A and B to batch: $2,800. Monthly savings: $1,400.
The Math
Three services, Sonnet 4.6:
Service A (email digests, 20K input + 5K output, 10,000 requests/month):
- Standard: (200M x $3/MTok) + (50M x $15/MTok) = $600 + $750 = $1,350
- Batch: $300 + $375 = $675
Service B (lead scoring, 3K input + 500 output, 50,000 requests/month):
- Standard: (150M x $3/MTok) + (25M x $15/MTok) = $450 + $375 = $825
- Batch: $225 + $187.50 = $412.50
Service C (moderation, stays real-time):
- $2,025/month (unchanged)
Total savings: $1,087.50/month from services A and B
The Technique
The migration from real-time to batch requires three changes: request format, submission method, and result retrieval.
Step 1: Adapt your request format.
# BEFORE: Real-time API call
import anthropic
client = anthropic.Anthropic()
def score_lead_realtime(form_data: dict) -> str:
response = client.messages.create(
model="claude-sonnet-4-6-20250929",
max_tokens=512,
system="Score this lead 1-100 based on form data.",
messages=[{"role": "user", "content": str(form_data)}]
)
return response.content[0].text
# AFTER: Batch-ready format
def prepare_lead_batch(leads: list[dict]) -> list[dict]:
return [
{
"custom_id": f"lead-{lead['id']}",
"params": {
"model": "claude-sonnet-4-6-20250929",
"max_tokens": 512,
"system": "Score this lead 1-100 based on form data.",
"messages": [
{"role": "user", "content": str(lead["form_data"])}
]
}
}
for lead in leads
]
Step 2: Submit and retrieve.
import time
def run_batch_pipeline(leads: list[dict]) -> dict:
"""Full batch pipeline: submit, wait, process."""
requests = prepare_lead_batch(leads)
# Submit
batch = client.batches.create(requests=requests)
print(f"Batch {batch.id}: {len(requests)} leads submitted")
# Wait
while True:
status = client.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(30)
# Retrieve
results = {}
for result in client.batches.results(batch.id):
if result.result.type == "succeeded":
results[result.custom_id] = result.result.message.content[0].text
else:
results[result.custom_id] = f"ERROR: {result.result.error.message}"
return results
Step 3: Schedule batch runs to match your SLA.
# Cron job: run lead scoring batch every hour
# crontab -e
# 0 * * * * /usr/bin/python3 /app/batch_lead_scoring.py >> /var/log/batch.log 2>&1
# batch_lead_scoring.py
python3 -c "
import json
# Collect leads from the last hour
# leads = fetch_unscored_leads(since='1 hour ago')
# results = run_batch_pipeline(leads)
# update_lead_scores(results)
print('Batch lead scoring: would process pending leads hourly')
print('Each batch saves 50% vs real-time API calls')
"
The key insight is that most backend services already have natural batching intervals. Email digests run on a schedule. Lead scoring can wait for hourly batches. Report generation is triggered by cron jobs. Map each service to its natural cadence and batch accordingly.
Services that seem real-time but actually have hidden latency tolerance:
- Webhook processing: Many webhooks retry, giving you minutes of tolerance
- Queue workers: Already async – batch the queue contents periodically
- Analytics pipelines: Usually tolerant of hours of delay
- Search indexing: Near-real-time is fine; batch every 15 minutes
The Tradeoffs
Moving to batch changes your system architecture:
- Error handling shifts from synchronous to asynchronous. Real-time errors surface immediately. Batch errors appear in results, requiring a separate error-processing pipeline.
- Monitoring becomes more complex. You need to track batch submission rates, processing times, and success rates as separate metrics from your application health checks.
- Debugging is harder. A bad prompt in a batch produces 10,000 bad results, not one. Test prompts thoroughly in real-time mode before batch migration.
- No streaming. If your application shows progressive output to users, batch cannot replicate that experience.
Batch limits are worth knowing: each batch accepts up to 100,000 requests with a maximum size of 256 MB. Batches that are not completed within 24 hours expire automatically. Results remain available for 29 days after batch creation. For Opus 4.7, Opus 4.6, and Sonnet 4.6, batch mode supports extended output up to 300K tokens per response when the output-300k-2026-03-24 beta header is included. This is particularly useful for large code generation or document synthesis tasks where both the 50% discount and the extended output window deliver value simultaneously.
Implementation Checklist
- Audit all Claude API call sites in your codebase
- Classify each as real-time-required or batch-eligible
- For batch-eligible endpoints, determine the natural batching interval (hourly, daily, on-demand)
- Convert request format to include
custom_idper request - Implement batch submission and polling in a background worker
- Build result processing that maps
custom_idback to your application data - Monitor batch costs for one billing cycle and verify the 50% reduction
Measuring Impact
Track the migration’s financial impact:
- Before/after monthly spend: Compare Anthropic invoices from the month before and after batch migration
- Batch coverage ratio: Percentage of total API requests routed through batch. Higher is better (target: 50-70% for most SaaS products).
- Per-request cost by mode: Should show exactly 2x difference between real-time and batch for the same model
- Set up a weekly cost report breaking down spend by real-time vs batch to prevent regression