Claude API Error 429 rate_limit_error Fix

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

When you hit the Claude API rate limit, the API returns a 429 status code with a rate_limit_error type. This guide explains exactly why it happens and how to handle it in your code.

The Error

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Your account has hit a rate limit."
  },
  "request_id": "req_018EeWyXxfu5pfWkrYcMdjWG"
}

Quick Fix

  1. Check the retry-after response header for how long to wait.
  2. Enable the SDK’s built-in retry mechanism (on by default with 2 retries).
  3. Monitor rate limit headers to throttle before hitting the limit.

What Causes This

The Claude API enforces three types of rate limits per tier:

Limits vary by spend tier. At Tier 4, Claude Opus 4.x allows 4,000 RPM, 2,000,000 ITPM, and 400,000 OTPM. The Opus 4.x rate limit is shared across Opus 4.6, 4.5, 4.1, and 4.

The API uses a token bucket algorithm where capacity is continuously replenished rather than reset at fixed intervals. You can also see 429 errors due to acceleration limits when your organization has a sharp increase in usage.

Full Solution

Python SDK with Built-in Retries

The Python SDK retries 429 errors automatically with exponential backoff (2 retries by default):

import anthropic

# Default: 2 retries with exponential backoff
client = anthropic.Anthropic()

# Increase retries for high-throughput workloads
client = anthropic.Anthropic(max_retries=5)

# Override retries per request
message = client.with_options(max_retries=5).messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

TypeScript SDK with Built-in Retries

import Anthropic from "@anthropic-ai/sdk";

// Default: 2 retries with exponential backoff
const client = new Anthropic();

// Increase retries
const client2 = new Anthropic({ maxRetries: 5 });

Monitor Rate Limit Headers

The API returns rate limit status in response headers. Check these before you hit the limit:

import anthropic

client = anthropic.Anthropic()
response = client.messages.with_raw_response.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

# Check remaining capacity
remaining_requests = response.headers.get("anthropic-ratelimit-requests-remaining")
remaining_tokens = response.headers.get("anthropic-ratelimit-tokens-remaining")
reset_time = response.headers.get("anthropic-ratelimit-requests-reset")
retry_after = response.headers.get("retry-after")

print(f"Requests remaining: {remaining_requests}")
print(f"Tokens remaining: {remaining_tokens}")
print(f"Reset at: {reset_time}")

Rate Limit Headers Reference

Header Description
anthropic-ratelimit-requests-limit Maximum requests per period
anthropic-ratelimit-requests-remaining Requests left in current window
anthropic-ratelimit-requests-reset When the request limit resets
anthropic-ratelimit-tokens-limit Maximum tokens per period
anthropic-ratelimit-tokens-remaining Tokens left in current window
anthropic-ratelimit-tokens-reset When the token limit resets
retry-after Seconds to wait before retrying

Prevention

  1. Use prompt caching: Cache-read input tokens do NOT count towards ITPM limits. With an 80% cache hit rate and a 2M ITPM limit, your effective throughput reaches 10M tokens per minute.
  2. Use the Batch API: For non-time-sensitive workloads, the Message Batches API has separate, higher rate limits (500,000 batch requests in queue at Tier 4) and costs 50% less.
  3. Upgrade your spend tier: Higher tiers get higher rate limits. Tier 1 starts at $5 credit purchase; Tier 4 requires $400.
  4. Set workspace limits: Configure per-workspace limits to prevent one service from consuming your entire organization’s quota.