When would you use Token counting & cost optimization?

Budget guardrail before expensive generation: A batch-processing pipeline sends thousands of documents to Claude Opus each night. Before each request, the system counts tokens and aborts any payload that exceeds a per-request token budget, logging it for human review instead of silently running up a large bill.

When would you use Token counting & cost optimization?

Intelligent model routing by payload size: An API gateway receives user queries of wildly varying lengths. It counts tokens for each incoming payload and routes short queries (under 2,000 tokens) to Claude Haiku for speed and low cost, medium queries to Claude Sonnet, and only routes very large or complex payloads to Claude Opus.

When would you use Token counting & cost optimization?

Dynamic context-window pruning in chatbots: A customer-support chatbot accumulates conversation history. Before each turn, it counts the full history's tokens. If the count exceeds a threshold, it trims the oldest message pairs one at a time, recounting each time, until the payload fits within both the context window and the per-turn cost budget.

← ContentsClaude API · advanced

Token counting & cost optimization

Token counting is a pre-flight utility provided by Anthropic's API that lets you calculate how many input tokens a payload will consume before sending it to Claude for generation. You call a dedicated endpoint with the exact same message structure you plan to send — including system prompts, conversation history, tool definitions, images, and PDFs — and receive back a single integer representing the total input token count. No generation occurs, and the call is free. This matters because Claude's API pricing is token-based, and costs can grow quickly with large system prompts, long conversation histories, or complex tool schemas. By counting tokens ahead of time, you can enforce budget limits, decide which model to route a request to, detect context-window overflows before they cause errors, and iteratively optimize prompts to hit a target size. Token counting is supported across all active Claude models and all API usage tiers. It has its own separate rate limit (requests per minute) that does not share a bucket with your message-creation rate limit. The count returned is an estimate: a small number of tokens may be added internally by Anthropic for system optimizations, but you are never billed for those additions. For precise billing figures, always check the usage object returned after generation or your Claude Console usage reports.

🎧 Listen to this as a podcast episode

When you’d use it

◆Budget guardrail before expensive generation — A batch-processing pipeline sends thousands of documents to Claude Opus each night. Before each request, the system counts tokens and aborts any payload that exceeds a per-request token budget, logging it for human review instead of silently running up a large bill.
◆Intelligent model routing by payload size — An API gateway receives user queries of wildly varying lengths. It counts tokens for each incoming payload and routes short queries (under 2,000 tokens) to Claude Haiku for speed and low cost, medium queries to Claude Sonnet, and only routes very large or complex payloads to Claude Opus.
◆Dynamic context-window pruning in chatbots — A customer-support chatbot accumulates conversation history. Before each turn, it counts the full history's tokens. If the count exceeds a threshold, it trims the oldest message pairs one at a time, recounting each time, until the payload fits within both the context window and the per-turn cost budget.
◆Prompt optimization and A/B testing — A developer has several candidate system prompts that describe the same behavior with different levels of verbosity. They count tokens for each candidate and choose the most concise one that stays under a target token count, reducing baseline input cost on every production call.
◆Tool schema overhead auditing — An agent has access to 20 tools with complex JSON schemas. Before deploying, an engineer counts tokens with and without each tool definition to measure per-tool overhead, then removes or simplifies schemas for tools that are rarely called, reducing fixed per-request costs.

What changed recently

◆2025-10-31 — Prompt cache read tokens no longer count against the Input Tokens Per Minute (ITPM) rate limit for Claude 3.7 Sonnet on the Anthropic API, allowing higher effective throughput for workloads that rely heavily on prompt caching.
◆2025-10 — Prompt caching was simplified: setting a cache breakpoint now causes Claude to automatically read from the longest previously cached prefix. Manual tracking of which cached segments to specify is no longer required.
◆2026-02-05 — Claude Opus 4.8 was released and added to the set of models supported by the token counting endpoint. Pricing starts at $5 per million input tokens and $25 per million output tokens, with up to 90% savings via prompt caching and 50% savings via batch processing.

This is the short version

The full chapter has three worked examples, the common pitfalls, and the workflow that makes it pay — plus the other 84 features, kept current.

Get Claude Master — $97 →