Three services, Sonnet 4.6: Service A (email digests, 20K input + 5K output, 10,000 requests/month): Standard: (200M x $3/MTok) + (50M x $15/MTok) = $600 + $750 = $1,350 Batch: $300 + $375 = $675 Service B (lead scoring, 3K input + 500 output, 50,000 requests/month): Standard: (150M x ...

Moving to batch changes your system architecture: Error handling shifts from synchronous to asynchronous. Real-time errors surface immediately. Batch errors appear in results, requiring a separate error-processing pipeline. Monitoring becomes more complex. You need to track batch submission...

How to Save 50% on Every Claude API (2026)

Last updated: April 19, 2026

Every Claude API call that does not need an immediate response is eligible for a 50% discount. Sonnet 4.6 drops from $3.00/$15.00 to $1.50/$7.50 per million tokens. Haiku 4.5 drops from $1.00/$5.00 to $0.50/$2.50. The mechanism is the Batch API, and the only trade-off is latency.

The Setup

You run a SaaS product with three backend services powered by Claude. Service A generates weekly email digests (runs Sunday night). Service B scores lead quality from form submissions (checked hourly by sales). Service C moderates user-generated content (needs results within 30 seconds).

Service A and Service B can tolerate batch latency. Service C cannot. By moving A and B to the Batch API, you save 50% on two-thirds of your Claude spend. Service C stays on real-time at full price.

Current monthly bill: $4,200. After migrating A and B to batch: $2,800. Monthly savings: $1,400.

The Math

Three services, Sonnet 4.6:

Service A (email digests, 20K input + 5K output, 10,000 requests/month):

Standard: (200M x $3/MTok) + (50M x $15/MTok) = $600 + $750 = $1,350
Batch: $300 + $375 = $675

Service B (lead scoring, 3K input + 500 output, 50,000 requests/month):

Standard: (150M x $3/MTok) + (25M x $15/MTok) = $450 + $375 = $825
Batch: $225 + $187.50 = $412.50

Service C (moderation, stays real-time):

$2,025/month (unchanged)

Total savings: $1,087.50/month from services A and B

The Technique

The migration from real-time to batch requires three changes: request format, submission method, and result retrieval.

Step 1: Adapt your request format.

# BEFORE: Real-time API call
import anthropic
client = anthropic.Anthropic()
def score_lead_realtime(form_data: dict) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=512,
        system="Score this lead 1-100 based on form data.",
        messages=[{"role": "user", "content": str(form_data)}]
    )
    return response.content[0].text
# AFTER: Batch-ready format
def prepare_lead_batch(leads: list[dict]) -> list[dict]:
    return [
        {
            "custom_id": f"lead-{lead['id']}",
            "params": {
                "model": "claude-sonnet-4-6-20250929",
                "max_tokens": 512,
                "system": "Score this lead 1-100 based on form data.",
                "messages": [
                    {"role": "user", "content": str(lead["form_data"])}
                ]
            }
        }
        for lead in leads
    ]

Step 2: Submit and retrieve.

import time
def run_batch_pipeline(leads: list[dict]) -> dict:
    """Full batch pipeline: submit, wait, process."""
    requests = prepare_lead_batch(leads)
    # Submit
    batch = client.batches.create(requests=requests)
    print(f"Batch {batch.id}: {len(requests)} leads submitted")
    # Wait
    while True:
        status = client.batches.retrieve(batch.id)
        if status.processing_status == "ended":
            break
        time.sleep(30)
    # Retrieve
    results = {}
    for result in client.batches.results(batch.id):
        if result.result.type == "succeeded":
            results[result.custom_id] = result.result.message.content[0].text
        else:
            results[result.custom_id] = f"ERROR: {result.result.error.message}"
    return results

Step 3: Schedule batch runs to match your SLA.

# Cron job: run lead scoring batch every hour
# crontab -e
# 0 * * * * /usr/bin/python3 /app/batch_lead_scoring.py >> /var/log/batch.log 2>&1
# batch_lead_scoring.py
python3 -c "
import json
# Collect leads from the last hour
# leads = fetch_unscored_leads(since='1 hour ago')
# results = run_batch_pipeline(leads)
# update_lead_scores(results)
print('Batch lead scoring: would process pending leads hourly')
print('Each batch saves 50% vs real-time API calls')
"

The key insight is that most backend services already have natural batching intervals. Email digests run on a schedule. Lead scoring can wait for hourly batches. Report generation is triggered by cron jobs. Map each service to its natural cadence and batch accordingly.

Services that seem real-time but actually have hidden latency tolerance:

Webhook processing: Many webhooks retry, giving you minutes of tolerance
Queue workers: Already async – batch the queue contents periodically
Analytics pipelines: Usually tolerant of hours of delay
Search indexing: Near-real-time is fine; batch every 15 minutes

The Tradeoffs

Moving to batch changes your system architecture:

Error handling shifts from synchronous to asynchronous. Real-time errors surface immediately. Batch errors appear in results, requiring a separate error-processing pipeline.
Monitoring becomes more complex. You need to track batch submission rates, processing times, and success rates as separate metrics from your application health checks.
Debugging is harder. A bad prompt in a batch produces 10,000 bad results, not one. Test prompts thoroughly in real-time mode before batch migration.
No streaming. If your application shows progressive output to users, batch cannot replicate that experience.

Batch limits are worth knowing: each batch accepts up to 100,000 requests with a maximum size of 256 MB. Batches that are not completed within 24 hours expire automatically. Results remain available for 29 days after batch creation. For Opus 4.7, Opus 4.6, and Sonnet 4.6, batch mode supports extended output up to 300K tokens per response when the output-300k-2026-03-24 beta header is included. This is particularly useful for large code generation or document synthesis tasks where both the 50% discount and the extended output window deliver value simultaneously.

Implementation Checklist

Audit all Claude API call sites in your codebase
Classify each as real-time-required or batch-eligible
For batch-eligible endpoints, determine the natural batching interval (hourly, daily, on-demand)
Convert request format to include custom_id per request
Implement batch submission and polling in a background worker
Build result processing that maps custom_id back to your application data
Monitor batch costs for one billing cycle and verify the 50% reduction

Measuring Impact

Track the migration’s financial impact:

Before/after monthly spend: Compare Anthropic invoices from the month before and after batch migration
Batch coverage ratio: Percentage of total API requests routed through batch. Higher is better (target: 50-70% for most SaaS products).
Per-request cost by mode: Should show exactly 2x difference between real-time and batch for the same model
Set up a weekly cost report breaking down spend by real-time vs batch to prevent regression

Which model? → Take the 5-question quiz in our Model Selector.

Try it: Estimate your monthly spend with our Cost Calculator.