Fix: Claude API Error 429 Rate Limit
The Error
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Rate limit reached. Please try again later."
}
}
HTTP status code: 429 Too Many Requests. The response includes a retry-after header indicating how long to wait.
Quick Fix
- Check the
retry-afterresponse header and wait the specified duration - Enable SDK automatic retries (both SDKs retry 429s by default with 2 retries)
- Reduce request frequency or switch to the Message Batches API for bulk workloads
const client = new Anthropic({
maxRetries: 5, // Default is 2
});
What Causes This
Anthropic enforces rate limits at multiple levels using a token bucket algorithm where capacity is continuously replenished rather than reset at fixed intervals.
Types of rate limits:
- Requests Per Minute (RPM) – a hard cap on the number of API calls per minute
- Input Tokens Per Minute (ITPM) – limit on input tokens processed per minute
- Output Tokens Per Minute (OTPM) – limit on output tokens generated per minute
Rate limits at Tier 4 (highest self-serve tier):
| Model | RPM | ITPM | OTPM |
|---|---|---|---|
| Opus 4.x | 4,000 | 2,000,000 | 400,000 |
| Sonnet 4.x | 4,000 | 2,000,000 | 400,000 |
| Haiku 4.5 | 4,000 | 4,000,000 | 800,000 |
Opus 4.x rate limits are shared across Opus 4.6, 4.5, 4.1, and 4. Sonnet 4.x rate limits are shared across Sonnet 4.6, 4.5, and 4.
Cache-aware ITPM: Only uncached input tokens count towards ITPM for most models. With an 80% cache hit rate and a 2M ITPM limit, effective throughput reaches 10M tokens per minute.
You may also see 429 errors from acceleration limits – sharp increases in usage can trigger rate limiting even if you are within your tier’s steady-state limits.
Full Solution
Option 1: SDK Automatic Retry
Both official SDKs include built-in retry with exponential backoff. The default is 2 retries for connection errors, 408, 409, 429, and 500+ status codes.
// TypeScript SDK
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
maxRetries: 5, // Default is 2
});
# Python SDK
import anthropic
client = anthropic.Anthropic(
max_retries=5, # Default is 2
)
# Or per-request override:
client.with_options(max_retries=5).messages.create(...)
Option 2: Manual Retry with Backoff
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function callWithBackoff(
fn: () => Promise<any>,
maxRetries = 5,
baseDelay = 1000
): Promise<any> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (
error instanceof Anthropic.RateLimitError &&
attempt < maxRetries - 1
) {
const delay = baseDelay * 2 ** attempt + Math.random() * 1000;
await new Promise((r) => setTimeout(r, delay));
continue;
}
throw error;
}
}
}
Option 3: Use the Message Batches API
For workloads that can tolerate asynchronous processing, the Message Batches API offers 50% cost savings and separate rate limits. Most batches complete within 1 hour.
import anthropic
client = anthropic.Anthropic()
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}],
},
}
for i, prompt in enumerate(prompts)
]
)
# Check batch status later
result = client.messages.batches.retrieve(batch.id)
A single batch can contain up to 100,000 requests or 256 MB, whichever is reached first. Results are available for 29 days after creation.
Option 4: Monitor Rate Limit Headers
The API returns these headers with every response:
anthropic-ratelimit-requests-limit/anthropic-ratelimit-requests-remaining/anthropic-ratelimit-requests-resetanthropic-ratelimit-tokens-limit/anthropic-ratelimit-tokens-remaining/anthropic-ratelimit-tokens-resetanthropic-ratelimit-input-tokens-limit/anthropic-ratelimit-input-tokens-remaining/anthropic-ratelimit-input-tokens-resetanthropic-ratelimit-output-tokens-limit/anthropic-ratelimit-output-tokens-remaining/anthropic-ratelimit-output-tokens-resetretry-after
Monitor the remaining headers to throttle your requests before hitting the limit.
Prevention
- Use prompt caching – cached input tokens do not count toward ITPM limits, dramatically increasing effective throughput
- Use the Message Batches API for non-urgent workloads (50% cost savings and separate rate limits)
- Implement client-side throttling to prevent bursts that trigger RPM limits
- Monitor rate limit headers and adjust throughput before hitting limits
- Avoid sharp usage increases which can trigger acceleration limits even within your tier