If you send the same system prompt or large context repeatedly (e.g., in a chatbot or pipeline), prompt caching can cut your input token costs by up to 90%.
How it works:
- Cache write (first request): Costs 1.25x the base input price (5-min TTL) or 2x (1-hour TTL)
- Cache read (subsequent requests): Costs only 0.1x the base input price
- The cache breaks even after just one read for 5-minute caching, or two reads for 1-hour caching
Example with Claude Sonnet 4 ($3/MTok base):
| Operation | Cost per MTok |
|---|---|
| Standard input | $3.00 |
| 5-min cache write | $3.75 |
| 1-hour cache write | $6.00 |
| Cache read | $0.30 |
Implementation:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal contract analyzer...", # your long system prompt
"cache_control": {"type": "ephemeral"} # enables 5-min caching
}
],
messages=[{"role": "user", "content": "Analyze this contract..."}]
)
Best candidates for caching: System prompts, few-shot examples, reference documentation, and any large static context that stays the same across requests. Combine with the Batch API (50% off) for maximum savings.