How to Save 50% on Every Claude API Call

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Every Claude API call that does not need an immediate response is eligible for a 50% discount. Sonnet 4.6 drops from $3.00/$15.00 to $1.50/$7.50 per million tokens. Haiku 4.5 drops from $1.00/$5.00 to $0.50/$2.50. The mechanism is the Batch API, and the only trade-off is latency.

The Setup

You run a SaaS product with three backend services powered by Claude. Service A generates weekly email digests (runs Sunday night). Service B scores lead quality from form submissions (checked hourly by sales). Service C moderates user-generated content (needs results within 30 seconds).

Service A and Service B can tolerate batch latency. Service C cannot. By moving A and B to the Batch API, you save 50% on two-thirds of your Claude spend. Service C stays on real-time at full price.

Current monthly bill: $4,200. After migrating A and B to batch: $2,800. Monthly savings: $1,400.

The Math

Three services, Sonnet 4.6:

Service A (email digests, 20K input + 5K output, 10,000 requests/month):

Service B (lead scoring, 3K input + 500 output, 50,000 requests/month):

Service C (moderation, stays real-time):

Total savings: $1,087.50/month from services A and B

The Technique

The migration from real-time to batch requires three changes: request format, submission method, and result retrieval.

Step 1: Adapt your request format.

# BEFORE: Real-time API call
import anthropic
client = anthropic.Anthropic()

def score_lead_realtime(form_data: dict) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6-20250929",
        max_tokens=512,
        system="Score this lead 1-100 based on form data.",
        messages=[{"role": "user", "content": str(form_data)}]
    )
    return response.content[0].text

# AFTER: Batch-ready format
def prepare_lead_batch(leads: list[dict]) -> list[dict]:
    return [
        {
            "custom_id": f"lead-{lead['id']}",
            "params": {
                "model": "claude-sonnet-4-6-20250929",
                "max_tokens": 512,
                "system": "Score this lead 1-100 based on form data.",
                "messages": [
                    {"role": "user", "content": str(lead["form_data"])}
                ]
            }
        }
        for lead in leads
    ]

Step 2: Submit and retrieve.

import time

def run_batch_pipeline(leads: list[dict]) -> dict:
    """Full batch pipeline: submit, wait, process."""
    requests = prepare_lead_batch(leads)

    # Submit
    batch = client.batches.create(requests=requests)
    print(f"Batch {batch.id}: {len(requests)} leads submitted")

    # Wait
    while True:
        status = client.batches.retrieve(batch.id)
        if status.processing_status == "ended":
            break
        time.sleep(30)

    # Retrieve
    results = {}
    for result in client.batches.results(batch.id):
        if result.result.type == "succeeded":
            results[result.custom_id] = result.result.message.content[0].text
        else:
            results[result.custom_id] = f"ERROR: {result.result.error.message}"

    return results

Step 3: Schedule batch runs to match your SLA.

# Cron job: run lead scoring batch every hour
# crontab -e
# 0 * * * * /usr/bin/python3 /app/batch_lead_scoring.py >> /var/log/batch.log 2>&1

# batch_lead_scoring.py
python3 -c "
import json

# Collect leads from the last hour
# leads = fetch_unscored_leads(since='1 hour ago')
# results = run_batch_pipeline(leads)
# update_lead_scores(results)
print('Batch lead scoring: would process pending leads hourly')
print('Each batch saves 50% vs real-time API calls')
"

The key insight is that most backend services already have natural batching intervals. Email digests run on a schedule. Lead scoring can wait for hourly batches. Report generation is triggered by cron jobs. Map each service to its natural cadence and batch accordingly.

Services that seem real-time but actually have hidden latency tolerance:

The Tradeoffs

Moving to batch changes your system architecture:

Batch limits are worth knowing: each batch accepts up to 100,000 requests with a maximum size of 256 MB. Batches that are not completed within 24 hours expire automatically. Results remain available for 29 days after batch creation. For Opus 4.7, Opus 4.6, and Sonnet 4.6, batch mode supports extended output up to 300K tokens per response when the output-300k-2026-03-24 beta header is included. This is particularly useful for large code generation or document synthesis tasks where both the 50% discount and the extended output window deliver value simultaneously.

Implementation Checklist

  1. Audit all Claude API call sites in your codebase
  2. Classify each as real-time-required or batch-eligible
  3. For batch-eligible endpoints, determine the natural batching interval (hourly, daily, on-demand)
  4. Convert request format to include custom_id per request
  5. Implement batch submission and polling in a background worker
  6. Build result processing that maps custom_id back to your application data
  7. Monitor batch costs for one billing cycle and verify the 50% reduction

Measuring Impact

Track the migration’s financial impact: