When would you use Prompt caching?

Multi-turn Conversational Agents: A customer support chatbot maintains a large system prompt with product documentation across dozens of user turns. Without caching, the full prompt is re-billed on every message. With caching, the static documentation is written once and read cheaply on each subsequent turn, making long conversations economically viable.

When would you use Prompt caching?

Large Document Processing: A legal team needs to run multiple queries against a 50,000-token contract document — checking clauses, extracting dates, identifying obligations. Caching the document means it is ingested once and queried repeatedly at 10% of the input token cost per query.

When would you use Prompt caching?

AI-Assisted Coding and Repository Q&A: A developer tool loads an extensive CLAUDE.md repository guide, API documentation, and coding style rules into the system prompt. By caching this context, every code completion or architecture question reads from cache rather than re-ingesting the full codebase on each keystroke.

← ContentsClaude API · advanced

Prompt caching

Prompt caching is an API-level optimization that stores the computed internal state (Key-Value cache) of a prompt prefix on Anthropic's infrastructure. When a subsequent API request begins with that exact same prefix, Claude retrieves the pre-computed state rather than reprocessing the text from scratch. This eliminates redundant computation for static content like system prompts, large documents, tool definitions, and few-shot examples. Developers enable caching by placing a `cache_control` parameter on content blocks in their API requests. Everything from the start of the payload up to and including the marked block gets cached. By default, the cache persists for 5 minutes and refreshes automatically on each cache hit at no additional cost. A 1-hour cache duration is also available at a higher write price. The primary benefits are cost reduction (cached reads cost 10% of base input token price) and latency reduction (less data to process means faster time-to-first-token). This makes long-context workflows — like document analysis, multi-turn conversations, and agentic loops — significantly cheaper and faster to run.

🎧 Listen to this as a podcast episode

When you’d use it

◆Multi-turn Conversational Agents — A customer support chatbot maintains a large system prompt with product documentation across dozens of user turns. Without caching, the full prompt is re-billed on every message. With caching, the static documentation is written once and read cheaply on each subsequent turn, making long conversations economically viable.
◆Large Document Processing — A legal team needs to run multiple queries against a 50,000-token contract document — checking clauses, extracting dates, identifying obligations. Caching the document means it is ingested once and queried repeatedly at 10% of the input token cost per query.
◆AI-Assisted Coding and Repository Q&A — A developer tool loads an extensive CLAUDE.md repository guide, API documentation, and coding style rules into the system prompt. By caching this context, every code completion or architecture question reads from cache rather than re-ingesting the full codebase on each keystroke.
◆Few-Shot Learning with Extensive Examples — A specialized domain classifier includes 20+ high-quality labeled examples in the prompt to maximize accuracy. Caching these examples means the model benefits from rich in-context learning without paying full input token costs on every inference call.
◆Agentic Tool-Use Loops — An autonomous agent cycles through dozens of tool calls (web search, code execution, file read/write) in a single run. By caching the complex JSON schemas for all tool definitions and system instructions at the start, each iteration of the thought-action loop avoids re-reading the agent's operational parameters.

What changed recently

◆2025-07 — Automatic caching launched for the Messages API, allowing a top-level cache_control field that automatically manages cache breakpoints as conversations grow, without requiring manual placement on individual content blocks.
◆2025-07 — 1-hour cache TTL option introduced as an alternative to the default 5-minute TTL, billed at 2x base input token price for writes (vs 1.25x for 5-minute writes).
◆2025-07 — Automatic caching made available on Azure AI Foundry (preview). AWS Bedrock and Google Vertex AI support explicit cache_control on content blocks but not automatic top-level caching.
◆2025-05 — Claude Opus 4 and Claude Opus 4.1 added to supported models list.

This is the short version

The full chapter has three worked examples, the common pitfalls, and the workflow that makes it pay — plus the other 84 features, kept current.

Get Claude Master — $97 →