When would you use Streaming?

Real-time chat interface: A customer support web app displays Claude's reply word-by-word as it is generated, giving users immediate feedback rather than a loading spinner that lasts several seconds.

When would you use Streaming?

Long-document generation without timeouts: A legal drafting tool requests a 50,000-token contract analysis. Without streaming, the HTTP connection would time out at most infrastructure layers before the response arrives. Streaming keeps the connection alive via continuous SSE events until the full document is ready.

When would you use Streaming?

Live code completion in an IDE plugin: A developer assistant streams a multi-function code suggestion into the editor in real time, letting the developer see and interrupt the direction of the suggestion before it completes.

← ContentsClaude API · intermediate

Streaming

Streaming is a method of receiving Claude API responses incrementally as the model generates them, rather than waiting for the complete response before delivery. It uses the Server-Sent Events (SSE) protocol: when you set `"stream": true` in your API request, the server sends a continuous sequence of small event chunks over a persistent HTTP connection, allowing your application to process and display output in real time. This approach has two primary practical benefits. First, it dramatically lowers perceived latency — users see the first tokens of a response almost instantly instead of staring at a spinner for several seconds while the model finishes generating. Second, it prevents HTTP gateway timeouts (e.g., 504 errors) for long responses that would exceed the default timeout window of many infrastructure components if delivered as a single synchronous response. Streaming is available through the raw HTTP API (by setting `"stream": true`) and through all official Anthropic SDKs (Python, TypeScript, PHP), which wrap the raw SSE handling in convenient context managers and iterators. The feature works with all Claude models and all API usage tiers, and is compatible with tool use, prompt caching, and extended thinking.

🎧 Listen to this as a podcast episode

When you’d use it

◆Real-time chat interface — A customer support web app displays Claude's reply word-by-word as it is generated, giving users immediate feedback rather than a loading spinner that lasts several seconds.
◆Long-document generation without timeouts — A legal drafting tool requests a 50,000-token contract analysis. Without streaming, the HTTP connection would time out at most infrastructure layers before the response arrives. Streaming keeps the connection alive via continuous SSE events until the full document is ready.
◆Live code completion in an IDE plugin — A developer assistant streams a multi-function code suggestion into the editor in real time, letting the developer see and interrupt the direction of the suggestion before it completes.
◆Agentic task progress visibility — A data pipeline agent streams its reasoning and intermediate tool calls to a monitoring dashboard, so operators can watch each step (database query, data transformation, report generation) as it happens rather than receiving a single opaque final result.
◆Extended thinking with partial visibility — A research assistant uses extended thinking mode with streaming so that the model's summarized reasoning chain appears in a side panel as it develops, before the final answer is rendered — giving users insight into how conclusions were reached.

What changed recently

◆2025-05 — Fine-grained tool streaming became generally available on all models and platforms. The beta header `fine-grained-tool-streaming-2025-05-14` is no longer required. Tool input parameters now stream incrementally via `input_json_delta` events rather than being buffered and delivered as a single block.
◆2025 — Stream recovery logic improved for Claude 4.6-generation and later models, enabling interrupted streams to be resumed via instructional prompting (appending partial response as an assistant turn and requesting continuation) without breaking the context window.
◆2025 — Extended thinking streaming support added `thinking_delta` events to the SSE stream. The `display` field for extended thinking blocks allows setting `thinking.display: 'omitted'` to suppress thinking content from the stream for lower latency, while preserving the signature required for multi-turn continuity.

This is the short version

The full chapter has three worked examples, the common pitfalls, and the workflow that makes it pay — plus the other 84 features, kept current.

Get Claude Master — $97 →