Automatic vs Manual Cache Breakpoints Guide

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

Claude’s prompt caching gives you up to 4 manual breakpoints per request. Each breakpoint defines a prefix boundary where cached content ends and fresh content begins. Placing them correctly can turn a $5.00/MTok input cost into $0.50/MTok. Placing them wrong wastes the 1.25x write premium on content that never gets reused.

The Setup

You are building a legal document review system. Each request includes three layers of context: a 15,000-token system prompt with instructions and formatting rules, a 40,000-token reference document (the contract being reviewed), and a 5,000-token conversation history that changes with every turn.

Without breakpoints, the entire 60,000 tokens get processed at full price every time. With two well-placed breakpoints – one after the system prompt, one after the reference document – you cache 55,000 tokens and only pay full price for the 5,000 dynamic tokens.

On Opus 4.7, that drops per-request input cost from $0.30 to $0.053 – an 82% savings on a single API call.

The Math

Legal review system, Opus 4.7, 50 queries per document:

Without caching:

With 2 manual breakpoints (system + document cached):

Savings: $12.06 per document (80%)

At 100 documents per month: $1,206 saved.

Compare single breakpoint vs two breakpoints:

The Technique

Manual breakpoints use cache_control annotations in your message content. Each breakpoint caches everything from the start of the message sequence up to that point.

import anthropic

client = anthropic.Anthropic()

def review_document(
    instructions: str,    # 15K tokens, stable across all documents
    document: str,        # 40K tokens, stable per document
    conversation: list,   # 5K tokens, changes every turn
    query: str
) -> str:
    """Legal review with two cache breakpoints."""

    response = client.messages.create(
        model="claude-opus-4-7-20250415",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": instructions,
                "cache_control": {"type": "ephemeral"}  # Breakpoint 1
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Reference document:\n{document}",
                        "cache_control": {"type": "ephemeral"}  # Breakpoint 2
                    },
                    {
                        "type": "text",
                        "text": query
                    }
                ]
            }
        ]
    )

    usage = response.usage
    cached = usage.cache_read_input_tokens
    written = usage.cache_creation_input_tokens
    fresh = usage.input_tokens

    print(f"Cached: {cached}, Written: {written}, Fresh: {fresh}")
    return response.content[0].text

Breakpoint placement rules:

  1. Most stable content first. System prompts rarely change and should be the first breakpoint. Documents change per session. Conversation history changes per turn.

  2. Respect minimum token thresholds. On Opus 4.7, each cached prefix must contain at least 4,096 tokens. A 2,000-token system prompt alone cannot be cached on Opus – but a 2,000-token system prompt combined with a 3,000-token document can be, under a single breakpoint.

  3. Four breakpoints maximum. You get 4 per request. Typical allocation:
    • Breakpoint 1: System prompt
    • Breakpoint 2: Reference data / documents
    • Breakpoint 3: Few-shot examples
    • Breakpoint 4: Conversation history prefix
  4. Order matters. Caching is prefix-based. Content after the last breakpoint is always processed at full input price. Structure your content from most-stable to least-stable.
# Verify your breakpoint placement is working
# Check cache metrics from a test request
curl -s https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6-20250929",
    "max_tokens": 100,
    "system": [{"type":"text","text":"'"$(cat system_prompt.txt)"'","cache_control":{"type":"ephemeral"}}],
    "messages": [{"role":"user","content":"test"}]
  }' | python3 -c "
import json, sys
r = json.load(sys.stdin)
u = r['usage']
print(f'Cache write: {u.get(\"cache_creation_input_tokens\", 0)}')
print(f'Cache read: {u.get(\"cache_read_input_tokens\", 0)}')
print(f'Uncached: {u[\"input_tokens\"]}')
"

The Tradeoffs

Manual breakpoints require careful management:

Implementation Checklist

  1. Map your prompt content into layers ordered by stability (most stable first)
  2. Verify each layer exceeds the minimum token threshold for your model
  3. Place breakpoints at the end of each stable layer (max 4)
  4. Test with a single request and verify cache_creation_input_tokens matches expected sizes
  5. Run 10 sequential requests and confirm cache_read_input_tokens on requests 2-10
  6. Monitor breakpoint efficiency: if any breakpoint shows more writes than reads, consolidate it

Measuring Impact

Measure breakpoint effectiveness with per-breakpoint metrics: