How to Reduce Claude API Token Usage by 50%
A production Claude Opus 4.7 pipeline processing 10,000 requests per day with a 2,000-token system prompt spends $100/day on system prompt tokens alone. Compressing that prompt to 500 tokens cuts daily input costs by $75 — saving $2,250/month on a single optimization.
The Setup
Token usage directly determines your Claude API bill. Every input token costs $5.00 per million on Opus 4.7, $3.00 on Sonnet 4.6, or $1.00 on Haiku 4.5. Every output token costs 5x the input rate.
Most developers write prompts for clarity, not efficiency. Verbose instructions, unused tool definitions, and bloated system prompts silently inflate every request. This guide covers five concrete techniques to cut token usage by 50% without reducing output quality.
Target: developers spending $500-$5,000/month on Claude API who have not optimized their prompts for token efficiency.
The Math
Typical unoptimized request on Opus 4.7:
- System prompt: 2,000 tokens
- Tool definitions: 1,500 tokens (3 tools)
- User message: 3,000 tokens
- Total input: 6,500 tokens -> $0.0325/request
After optimization:
- System prompt: 500 tokens (75% reduction)
- Tool definitions: 500 tokens (1 tool, only when needed)
- User message: 2,500 tokens (structured formatting)
- Total input: 3,500 tokens -> $0.0175/request
Savings per request: $0.015 (46%)
At 10,000 requests/day: $150/day saved -> $4,500/month
Output optimization (constrain max tokens, request concise responses):
- Before: average 1,500 output tokens -> $0.0375
- After: average 800 output tokens -> $0.020
- Additional savings: $0.0175/request -> $5,250/month
Combined savings: $9,750/month (both input and output)
The Technique
1. Compress System Prompts
# BEFORE: 2,000+ tokens
VERBOSE_SYSTEM = """
You are a helpful AI assistant specialized in analyzing customer support tickets.
Your role is to carefully read each ticket, understand the customer's issue,
and provide a classification that our support team can use to route the ticket
to the appropriate department. You should consider the following categories
when classifying tickets: billing issues, technical problems, account access,
feature requests, and general inquiries. Please be thorough in your analysis
and provide a confidence score along with your classification. It is important
that you respond in a structured format so that our automated systems can
parse your response correctly. Always include the category, confidence score,
and a brief explanation of why you chose that category.
"""
# AFTER: 500 tokens — same behavior, fewer tokens
LEAN_SYSTEM = """Classify support tickets into: billing, technical, access, feature_request, general.
Output format: {"category": "...", "confidence": 0.0-1.0, "reason": "one sentence"}
Output only valid JSON. No preamble."""
2. Remove Unused Tool Definitions
Each tool definition adds tokens to every request. The bash tool adds 245 tokens, the text editor adds 700 tokens, and computer use adds 735 tokens per call.
import anthropic
client = anthropic.Anthropic()
# BAD: Including tools you don't need
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=[
{"name": "search_db", "description": "Search the database", "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}}},
{"name": "send_email", "description": "Send an email", "input_schema": {"type": "object", "properties": {"to": {"type": "string"}, "body": {"type": "string"}}}},
{"name": "update_record", "description": "Update a record", "input_schema": {"type": "object", "properties": {"id": {"type": "string"}, "data": {"type": "object"}}}},
],
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
# Wasted ~1,000+ tokens on tool definitions for a simple question
# GOOD: Only include tools when the task needs them
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
3. Structure User Messages for Token Efficiency
# BEFORE: Narrative style (~800 tokens)
verbose_prompt = """
I have a JSON file that contains customer data. Each customer has a name,
email, phone number, and subscription status. I need you to go through
the data and find all customers whose subscription has expired. For each
expired customer, I need their name and email address. The data is as follows:
[paste large JSON block here]
"""
# AFTER: Structured style (~400 tokens)
lean_prompt = """Extract expired subscriptions from this data.
Return: name, email for each.
Output: JSON array.
Data:
[paste JSON block]"""
4. Constrain Output Length
# Request concise output explicitly
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200, # Hard cap prevents runaway output
system="Respond in under 100 words. No preamble. No summary.",
messages=[{"role": "user", "content": "Explain what a mutex does in concurrent programming."}],
)
5. Use Abbreviations in Structured Prompts
# BEFORE: 150 tokens
system = "When analyzing code, check for: security vulnerabilities, performance issues, maintainability concerns, error handling gaps, and documentation completeness."
# AFTER: 50 tokens
system = "Code analysis checklist: security, perf, maintainability, error handling, docs."
The Tradeoffs
Aggressive prompt compression can reduce output quality. If you compress too far, Claude may misinterpret instructions or produce less accurate results. Always A/B test compressed prompts against originals on 50+ representative inputs.
Setting max_tokens too low causes truncated responses. Start by measuring your actual output token distribution, then set max_tokens to the 95th percentile — this prevents waste without cutting off legitimate responses.
Removing tool definitions when tools might be needed causes failures. Only prune tools when you are certain the request type never requires them.
Implementation Checklist
- Measure current token usage per request type (input and output separately)
- Compress your top 3 system prompts by removing redundant instructions
- Audit tool definitions — remove any not used by the current request type
- Add max_tokens constraints based on actual output length distribution
- A/B test compressed prompts against originals for quality
- Deploy optimized prompts and monitor token usage reduction
- Repeat monthly as your prompt library evolves
Measuring Impact
Compare average tokens per request before and after optimization. Track four metrics: average input tokens per request (target: 40-60% reduction), average output tokens per request (target: 20-40% reduction), total monthly token consumption, and quality score on a held-out test set (target: less than 3% degradation). Use the Anthropic usage dashboard to verify token counts are dropping as expected.
Related Guides
- Claude Code Token Usage Optimization — comprehensive token optimization strategies
- Reduce Claude Code Hallucinations Save Tokens — write better prompts that use fewer tokens
- Claude Skill Token Usage Profiling — measure per-skill token consumption