ChatRaj
LLM internals

Streaming response

Streaming response is the technique of pushing each generated token from an LLM down to the client the moment it exists, instead of waiting for the whole completion.

Bottom line
Streaming response is the technique of pushing each generated token from an LLM down to the client the moment it exists, instead of waiting for the whole completion. The transport is almost always Server-Sent Events. Users see words appear immediately, so a six-second answer feels like a two-second one.
Reviewed by ··5 min read
Jump to section

What streaming response actually is

A streaming response is the incremental delivery of tokens from a large language model as they are generated, rather than buffering the full completion server-side and returning it in one payload. The model still produces tokens one at a time internally. Streaming just removes the buffer between "token exists on the server" and "token reaches the user's screen."

The distinction matters because LLM generation is slow by web standards. A 600-token answer at 80 tokens per second takes roughly 7.5 seconds end to end. Without streaming, the user stares at a spinner for the full 7.5 seconds. With streaming, the first word lands in 200 to 500 milliseconds and the rest flows in like a person typing. Total time is identical. Perceived time is not.

This is purely a transport-layer concern. Streaming does not change what the model generates, what it costs, or how the KV cache behaves. It changes when the bytes reach the browser.

How SSE delivers tokens to the browser

The dominant transport is Server-Sent Events (SSE), an HTTP standard codified in the WHATWG HTML Living Standard. The MIME type is text/event-stream, and the body is a sequence of UTF-8 lines framed by blank-line separators. Each message looks like:

code
data: {"delta": "Hello"}
data: {"delta": " world"}
data: [DONE]

Each data: line carries a small JSON delta with a fragment of generated content. The browser consumes the stream with the built-in EventSource API or, more commonly in modern AI apps, a fetch call with a streaming ReadableStream body that the client parses manually. The manual approach is preferred because EventSource cannot send custom headers (no auth tokens) and cannot use anything other than GET.

Why SSE instead of WebSockets? SSE is one-way, server to client, riding on plain HTTP. That is exactly what LLM streaming needs. WebSockets are bidirectional, require a protocol upgrade, and bring more operational complexity to gateways and CDNs. For a one-way firehose of tokens, SSE is the boring correct answer. A few providers expose chunked HTTP without the SSE framing, but the wire shape is similar.

Provider conventions vary. OpenAI's Responses API emits typed events like response.output_text.delta carrying a delta string. Anthropic's Messages API emits a structured sequence: message_start, then content_block_start, then a run of content_block_delta events (each holding a text_delta or input_json_delta), then content_block_stop and message_stop. Different framing, same idea: deltas first, terminators last.

Why streaming matters for AI chatbot UX

The metric that matters is TTFT, time to first token. Total latency is what your accountant cares about. TTFT is what the user feels. Research on perceived performance is consistent: under one second feels instant, under three seconds keeps users engaged, and beyond five seconds without visible progress feels broken. Streaming converts a five-to-ten-second wait into a sub-second TTFT plus a progressive reveal, and the progressive reveal itself signals "the system is working" in a way no spinner can match.

There is a second-order benefit: users start reading immediately and frequently get what they need before the full answer arrives. A FAQ answer might be useful after the first sentence. The user closes the widget, the request completes in the background, and total perceived task time drops further.

ChatRaj streams every response into the widget via SSE. First-token latency is consistently under two seconds, even on responses that take eight seconds total to finish writing. The widget renders Markdown progressively, so code blocks, lists, and headings appear formatted as they stream in rather than rearranging once the response completes.

This is the technique that separates AI chatbots that feel modern from ones that feel like 2015-era request-response forms. It is also, in our experience, the single highest-leverage UX change for a chatbot project. Streaming is roughly half the perceived-quality gap between a hobby build and a production widget.

Streaming gotchas: error handling, abort signals, parsing

Naive streaming tutorials cover the happy path and skip the parts that bite you in production.

Errors mid-stream. A provider can send a hundred token deltas and then emit an error event because the request hit a rate limit or a content filter. Your client cannot treat the partial output as a complete message. It needs to either show the partial text with an inline error indicator or roll back the message bubble. Either way, the JSON parser around your event stream has to distinguish "delta event" from "error event" and route them to different code paths.

Aborting cleanly. Users close widgets, navigate away, or type a new question before the current one finishes. Without an abort path, your server keeps streaming tokens nobody will read, and you keep paying for them. The right pattern is an AbortController on the browser's fetch, with the AbortSignal forwarded to the upstream provider client. Both OpenAI and Anthropic SDKs accept abort signals and will cancel the underlying request, stopping the token meter.

Partial JSON. When the model is generating structured output or a tool call, each chunk is a fragment of a larger JSON object. {"name": "lookup_or is not valid JSON. You must buffer until you have a structurally complete value, or use a streaming JSON parser that emits partial paths. Anthropic helps here by emitting input_json_delta events with a concatenable partial_json string; OpenAI's tool-call deltas work similarly. Treat each delta as a byte stream, not a JSON message.

Backpressure and reconnection. SSE includes a Last-Event-ID mechanism for resuming a dropped connection, but most LLM streams are too short-lived to bother. Just fail fast and let the user retry. Do make sure intermediate proxies (load balancers, CDNs, mobile carrier proxies) are not buffering your event stream; setting X-Accel-Buffering: no and disabling response compression on the streaming endpoint usually fixes that.

Streaming is one of those features that looks like ten lines of code in a tutorial and turns into a small subsystem in production. Worth every minute of it.

FAQ

Common Streaming response questions

Users perceive responses as much faster when they see incremental output. Total latency is identical, but time to first token drops from several seconds to a few hundred milliseconds, which is the metric users actually feel.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML