RAG vs Context Stuffing: Claude Cost Analysis
Stuffing 200K tokens of documentation into every Claude Opus 4.7 request costs $1.00 per query. A RAG pipeline that retrieves only the relevant 20K tokens costs $0.10 per query — 90% less. At 1,000 queries per day, that is the difference between $30,000/month and $3,000/month.
The Setup
Context stuffing means loading an entire document, codebase, or knowledge base into Claude’s context window with every request. It is simple to implement and works well because Claude can find relevant information anywhere in its context. But it is expensive because you pay for every token, whether Claude needs it or not.
RAG (Retrieval-Augmented Generation) retrieves only the relevant sections before sending them to Claude. It adds infrastructure complexity but cuts context tokens by 80-95%, delivering massive cost savings.
This guide compares both approaches head-to-head with real cost calculations and helps you decide which is right for your workload.
The Math
Scenario: Q&A system over a 200K-token knowledge base, 1,000 queries/day
Context Stuffing (200K tokens per request, Opus 4.7):
- Input: 200K * $5.00/MTok = $1.00/query
- Output (avg 500 tokens): 500 * $25/MTok = $0.0125/query
- Per query: $1.0125
- Daily: $1,012.50
- Monthly: $30,375
RAG Pipeline (20K relevant tokens per request, Opus 4.7):
- Embedding cost: ~$0.001/query (negligible)
- Vector search: ~$0.0005/query (negligible)
- Input: 20K * $5.00/MTok = $0.10/query
- Output (avg 500 tokens): $0.0125/query
- Per query: $0.1135
- Daily: $113.50
- Monthly: $3,405
Savings: $26,970/month (89%)
Context Stuffing with Caching (best case):
- Cache write: 200K * $6.25/MTok = $1.25 (one-time per 5 minutes)
- Cache reads: 200K * $0.50/MTok = $0.10/query
- Per query (after first): $0.1125
- Monthly (1K/day, steady traffic): $3,375
Cached context stuffing matches RAG pricing — but only with consistent traffic within the 5-minute cache TTL.
The Technique
Option 1: Simple RAG Pipeline
import anthropic
import numpy as np
from typing import list
client = anthropic.Anthropic()
class SimpleRAG:
def __init__(self, chunks: list, model: str = "claude-sonnet-4-6"):
self.chunks = chunks
self.model = model
# In production, use a proper embedding model and vector DB
self.chunk_keywords = [set(c.lower().split()) for c in chunks]
def retrieve(self, query: str, top_k: int = 3) -> list:
"""Simple keyword-based retrieval (replace with embeddings in production)."""
query_words = set(query.lower().split())
scores = []
for i, keywords in enumerate(self.chunk_keywords):
overlap = len(query_words & keywords)
scores.append((overlap, i))
scores.sort(reverse=True)
return [self.chunks[i] for _, i in scores[:top_k]]
def query(self, question: str) -> dict:
"""Retrieve relevant context and query Claude."""
relevant_chunks = self.retrieve(question)
context = "\n---\n".join(relevant_chunks)
response = client.messages.create(
model=self.model,
max_tokens=1024,
system="Answer based on the provided context. If the context does not contain the answer, say so.",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
}],
)
input_cost = response.usage.input_tokens * 3.0 / 1_000_000 # Sonnet rate
output_cost = response.usage.output_tokens * 15.0 / 1_000_000
return {
"answer": response.content[0].text,
"chunks_retrieved": len(relevant_chunks),
"input_tokens": response.usage.input_tokens,
"cost": f"${input_cost + output_cost:.4f}",
}
# Usage
knowledge_base = [
"Refund policy: Full refund within 30 days of purchase. Partial refund within 60 days.",
"Shipping: Standard 5-7 days, Express 2-3 days. Free shipping over $50.",
"Returns: Email [email protected] with order number. Include photos of damaged items.",
# ... hundreds more chunks
]
rag = SimpleRAG(knowledge_base)
result = rag.query("What is the refund policy?")
print(f"Answer: {result['answer']}")
print(f"Tokens: {result['input_tokens']} | Cost: {result['cost']}")
Option 2: Context Stuffing with Caching
def cached_context_query(
full_context: str,
question: str,
model: str = "claude-opus-4-7",
) -> dict:
"""Use full context with caching to reduce repeat-read costs."""
response = client.messages.create(
model=model,
max_tokens=1024,
system=[{
"type": "text",
"text": f"You are a knowledge base assistant. Answer questions using this context:\n\n{full_context}",
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": question}],
)
# Check cache performance
cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
cache_write = getattr(response.usage, "cache_creation_input_tokens", 0)
return {
"answer": response.content[0].text,
"total_input_tokens": response.usage.input_tokens,
"cache_read_tokens": cache_read,
"cache_write_tokens": cache_write,
"was_cached": cache_read > 0,
}
Decision Framework
def recommend_approach(
knowledge_base_tokens: int,
queries_per_day: int,
queries_per_5min: float,
model: str = "claude-opus-4-7",
) -> dict:
"""Recommend RAG vs context stuffing based on your workload."""
rates = {
"claude-opus-4-7": {"input": 5.0, "cache_write": 6.25, "cache_read": 0.5},
"claude-sonnet-4-6": {"input": 3.0, "cache_write": 3.75, "cache_read": 0.3},
}
r = rates[model]
# Context stuffing cost (no cache)
stuff_cost = knowledge_base_tokens * r["input"] / 1_000_000 * queries_per_day * 30
# Context stuffing with cache
cache_writes_per_day = queries_per_day / max(queries_per_5min, 1) * (5/1440)
cache_write_cost = cache_writes_per_day * knowledge_base_tokens * r["cache_write"] / 1_000_000
cache_read_cost = queries_per_day * knowledge_base_tokens * r["cache_read"] / 1_000_000
cached_stuff_cost = (cache_write_cost + cache_read_cost) * 30
# RAG cost (assume 10% of KB retrieved per query)
rag_tokens = knowledge_base_tokens * 0.1
rag_cost = rag_tokens * r["input"] / 1_000_000 * queries_per_day * 30
cheapest = min(
("context_stuffing", stuff_cost),
("cached_stuffing", cached_stuff_cost),
("rag", rag_cost),
key=lambda x: x[1],
)
return {
"context_stuffing_monthly": f"${stuff_cost:.2f}",
"cached_stuffing_monthly": f"${cached_stuff_cost:.2f}",
"rag_monthly": f"${rag_cost:.2f}",
"recommendation": cheapest[0],
"cheapest_cost": f"${cheapest[1]:.2f}",
}
rec = recommend_approach(200_000, 1000, 10)
print(f"Recommendation: {rec['recommendation']} at {rec['cheapest_cost']}/month")
The Tradeoffs
RAG adds infrastructure complexity: embedding model, vector database, chunking logic, and retrieval tuning. For small knowledge bases (under 50K tokens), context stuffing with caching is simpler and similarly priced.
Context stuffing guarantees Claude sees all information. RAG may miss relevant context if the retrieval step fails to find it. For high-stakes applications (legal, medical), context stuffing or RAG with high recall settings is safer.
Cached context stuffing requires steady query traffic. If queries arrive less than once every 5 minutes, the cache expires and you pay full write cost repeatedly — worse than RAG.
Implementation Checklist
- Measure your knowledge base size in tokens
- Count your average daily query volume
- Calculate query frequency (queries per 5-minute window)
- Run the decision framework to compare costs
- If RAG wins: implement chunking and retrieval pipeline
- If cached stuffing wins: implement caching with ephemeral cache_control
- Monitor costs weekly and re-evaluate quarterly
Measuring Impact
Track cost per query across your chosen approach. For RAG: monitor retrieval recall (percentage of queries answered correctly) alongside cost. For cached stuffing: monitor cache hit rate (target above 90%). If cache hit rate drops below 70%, your traffic pattern may be too sparse for caching to help, and RAG becomes the better option. Compare monthly API spend before and after switching approaches.
Related Guides
- Why Is Claude Code Expensive — context size as the primary cost driver
- Claude Code Context Window Management Guide — managing context for cost efficiency
- Why Does Anthropic Limit Claude Code Context Window — why context limits exist