When would you use Rate limits & error handling?

Preventing accidental overspend in development: A solo developer building a prototype wants to avoid unexpectedly exhausting their Tier 1 allocation. They monitor the Console Limits page and set a workspace-level cap well below the organization ceiling so experimental scripts cannot consume production quota.

When would you use Rate limits & error handling?

Graceful degradation in a customer-facing chatbot: A customer service bot occasionally hits RPM limits during traffic spikes. The application catches 429 responses, reads the retry-after header, queues the request, and shows the user a 'one moment…' message rather than surfacing a raw error.

When would you use Rate limits & error handling?

High-throughput document processing pipeline: A legal tech company processes thousands of contracts per day. Engineers use the Message Batches API (50% cost discount) and implement exponential backoff on 429 and 529 errors to saturate throughput without manual intervention.

← ContentsClaude API · advanced

Rate limits & error handling

Rate limits are enforced caps on how many API requests and tokens an organization can process within a given time window. Anthropic measures them across three dimensions: Requests Per Minute (RPM), Input Tokens Per Minute (ITPM), and Output Tokens Per Minute (OTPM). Limits are applied at the organization level, with optional workspace-level sub-limits configurable by administrators. The API uses a token bucket algorithm, meaning capacity refills continuously rather than resetting at fixed clock intervals. When any limit is exceeded, the API returns an HTTP 429 error with a machine-readable JSON body identifying which limit was breached, plus a `retry-after` header specifying how many seconds to wait before retrying. Other relevant error codes include 500 (api_error), 504 (timeout_error), and 529 (overloaded_error). For streaming responses delivered over SSE, errors can surface after an initial 200 OK, so error handling must cover the entire stream lifetime, not just the connection phase. Organizations advance through usage tiers (Tier 1 through Tier 4) automatically as cumulative spend increases, unlocking higher RPM, ITPM, and OTPM ceilings at each level. Prompt caching provides an effective throughput multiplier because, on most models, only uncached input tokens count toward the ITPM limit. Specialized endpoints such as the Message Batches API, Fast Mode, and Claude Managed Agents maintain separate rate limit pools distinct from the standard Messages API.

🎧 Listen to this as a podcast episode

When you’d use it

◆Preventing accidental overspend in development — A solo developer building a prototype wants to avoid unexpectedly exhausting their Tier 1 allocation. They monitor the Console Limits page and set a workspace-level cap well below the organization ceiling so experimental scripts cannot consume production quota.
◆Graceful degradation in a customer-facing chatbot — A customer service bot occasionally hits RPM limits during traffic spikes. The application catches 429 responses, reads the retry-after header, queues the request, and shows the user a 'one moment…' message rather than surfacing a raw error.
◆High-throughput document processing pipeline — A legal tech company processes thousands of contracts per day. Engineers use the Message Batches API (50% cost discount) and implement exponential backoff on 429 and 529 errors to saturate throughput without manual intervention.
◆Multi-tenant SaaS with isolated quotas per customer — A platform serving multiple enterprise clients creates one Console workspace per client and assigns each a workspace-level rate limit. This prevents one noisy tenant from exhausting shared organizational quota and gives per-tenant usage visibility.
◆Autonomous agent loop with pause_turn handling — A coding assistant agent uses Claude to run multi-step tool calls. The orchestration layer watches for a pause_turn stop reason, appends the partial assistant turn to the messages array with required tool results, and resubmits—allowing long server-side loops to resume without losing state.

What changed recently

◆2026-05 — Rate Limits API released, allowing administrators to programmatically query RPM, ITPM, and OTPM limits for their organization and workspaces via GET /v1/organizations/rate_limits using an Admin API key.
◆2026-05 — Rate limit charts launched in the Console Usage page, providing two time-series visualizations showing headroom across rate limit dimensions and caching rates.
◆2026-03 — API rate limits raised considerably for Claude Opus models. Dedicated 1M-context rate limits for all supported models were removed; standard account limits now apply across all context lengths.
◆2026-02 — Fast Mode (research preview) extended to support Claude Opus 4.7 (model: claude-opus-4-7, beta header: fast-mode-2026-02-01). Fast Mode uses dedicated rate limit pools separate from standard Opus limits.

This is the short version

The full chapter has three worked examples, the common pitfalls, and the workflow that makes it pay — plus the other 84 features, kept current.

Get Claude Master — $97 →