← LearnClaude API · Optimization

How to Optimize Claude API Costs with Token Counting

Use Claude's free token counting endpoint to calculate input tokens before generation, enabling smart model routing, budget enforcement, and prompt optimization that can reduce API costs by 50% or more.

Use Claude's free token counting endpoint to calculate input tokens before generation, enabling smart model routing, budget enforcement, and prompt optimization that can reduce API costs by 50% or more. The count_tokens API lets you make cost-aware decisions before expensive generation calls by providing exact token counts for any payload structure.

What is Claude's Token Counting API?

Token counting is a pre-flight utility provided by Anthropic's API that lets you calculate how many input tokens a payload will consume before sending it to Claude for generation. You call a dedicated endpoint with the exact same message structure you plan to send — including system prompts, conversation history, tool definitions, images, and PDFs — and receive back a single integer representing the total input token count.

This matters because Claude's API pricing is token-based, and costs can grow quickly with large system prompts, long conversation histories, or complex tool schemas. The service is completely free to use and has its own separate rate limit that doesn't share a bucket with your message-creation rate limit.

How Do You Count Tokens Before Generation?

The basic process involves calling the same endpoint structure you'd use for generation, but targeting the count_tokens endpoint instead:

import anthropic
client = anthropic.Anthropic()

response = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    system="You are a helpful assistant.",
    messages=[{
        "role": "user",
        "content": "What is the capital of France?"
    }]
)
print(f"Input tokens: {response.input_tokens}")

This example returns approximately 25 input tokens. You can then multiply this by the model's per-token price to estimate costs before committing to generation. The count returned is an estimate: a small number of tokens may be added internally by Anthropic for system optimizations, but you are never billed for those additions.

What Are the Most Effective Cost Optimization Strategies?

The most impactful optimization strategies using token counting include:

Intelligent Model Routing

Route requests to cheaper models based on payload complexity. A production system can count tokens and route short queries under 1,500 tokens to Claude Haiku for speed and low cost, medium queries to Claude Sonnet, and only route very large or complex payloads to Claude Opus:

def route_by_tokens(client, messages, system_prompt=""):
    count = client.messages.count_tokens(
        model="claude-sonnet-4-5-20250929",
        system=system_prompt,
        messages=messages
    ).input_tokens

    if count < 1500:
        model = "claude-haiku-4-5-20251001"
    elif count < 8000:
        model = "claude-sonnet-4-5-20250929"
    else:
        model = "claude-opus-4-8"

    return client.messages.create(
        model=model,
        max_tokens=1024,
        system=system_prompt,
        messages=messages
    )

Budget Enforcement

Implement hard token limits to prevent runaway costs. A batch-processing pipeline can count tokens and abort any payload that exceeds a per-request token budget, logging it for human review instead of silently running up a large bill.

Dynamic Context Pruning

Customer-support chatbots can accumulate conversation history and use token counting to trim the oldest message pairs when approaching context window or cost limits, ensuring conversations stay within budget while maintaining recent context.

How Do You Optimize Tool Schema Overhead?

Tool definitions can add hundreds of tokens per tool to every request. Use token counting to audit this overhead by measuring payloads with and without each tool definition:

# Count without tools
base_count = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    messages=messages
).input_tokens

# Count with all tools
with_tools_count = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    messages=messages,
    tools=all_tools
).input_tokens

overhead = with_tools_count - base_count
print(f"Tool overhead: {overhead} tokens")

This lets you remove or simplify schemas for tools that are rarely called, reducing fixed per-request costs. An agent with access to 20 tools can measure per-tool overhead and optimize which tools to include by default versus lazy-load on demand.

When Should You Use Token Counting vs. Alternatives?

Token counting excels for pre-generation decisions like routing, budget enforcement, and prompt optimization. Use it when you need lightweight, free pre-flight checks before committing to expensive generation calls.

However, for exact billing figures including cache hits and misses, inspect the usage object in the messages.create response instead. Prompt caching can reduce effective cost per request by up to 90% and latency by up to 85%, but it only activates at generation time — token counting does not trigger or reflect caching benefits.

Avoid third-party tokenizer libraries like tiktoken, which are designed for other models and will produce inaccurate counts for Claude. Character or word counting can provide rough estimates (approximately 4 characters per token for English text), but this breaks down with code, non-Latin scripts, and structured data.

What Are Common Pitfalls to Avoid?

The most critical mistake is treating token counts as exact billing figures. The count_tokens endpoint returns an estimate, and Anthropic may add tokens internally for system optimizations. Use counts for budgeting and routing decisions, but check the usage object in actual generation responses for precise billing.

Always include the same tools array in count_tokens that you'll use in messages.create. Tool schemas add significant overhead that's easy to forget when counting tokens for budget planning.

When using extended thinking with budget_tokens, remember that thinking tokens are billed at output rates, which can be expensive. Count tokens with different budget_tokens values (2,000 vs. 8,000 vs. 16,000) to understand the impact before committing to a budget size.

Finally, ensure you use the exact same model identifier in both count_tokens and messages.create calls, as tokenization can vary slightly between model generations.

Is Token Counting Worth the Implementation Effort?

For any production system processing significant volume, token counting pays for itself quickly. The ability to route 70% of simple queries to Haiku instead of Opus can reduce costs by 10x or more on those requests. Budget enforcement prevents surprise bills from oversized payloads, and prompt optimization helps you find the most concise system prompts that maintain quality.

The implementation overhead is minimal since token counting uses the same message structure as generation. Most teams see immediate ROI from basic model routing, with additional savings from budget controls and tool schema optimization as systems mature.

Token counting is supported across all active Claude models and all API usage tiers, making it accessible regardless of your current plan. The separate rate limit ensures pre-flight counting won't interfere with your generation throughput, though high-frequency counting in tight loops can exhaust this limit independently.

Frequently asked questions

Does token counting cost money?

No, the count_tokens endpoint is completely free. No generation occurs and you're only charged when you make actual messages.create calls.

How accurate are token count estimates?

Very accurate for budgeting purposes. Anthropic may add a small number of tokens internally for system optimizations, but you're never billed for those additions.

Can I count tokens for images and PDFs?

Yes, token counting supports the exact same payload structure as generation, including images, PDFs, tool definitions, and conversation history.

Do cached prompts affect token counting?

Token counting reports raw unoptimized input size. Prompt caching only activates during actual generation, so use raw counts for worst-case budgeting.

Go deeper

Token counting & cost optimization is one of 85 features in Claude Master — the independent, always-current manual with worked examples, the pitfalls, and the workflows that make Claude pay.

Get Claude Master — founding price

Independent product. Not affiliated with or endorsed by Anthropic. "Claude" is a trademark of Anthropic, used here only to describe the subject of this guide.

CLAUDEMASTER
An independent publication.
Independent product. Not affiliated with or endorsed by Anthropic. “Claude” is a trademark of Anthropic, used here only to describe the subject of this manual.
© 2026 Claude Master — All rights reserved.