Skip to main content
ModelTerms

Inference · beginner

Streaming (LLM Responses) (SSE streaming, token streaming)

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

Explanation

Without streaming, a 500-token response means waiting ~5-10 seconds in silence before anything appears. With streaming, the user sees the first token within ~500 ms (TTFT) and then a typing-speed flow of text. The total wall-clock time is the same, but the perceived latency is dramatically lower.

OpenAI, Anthropic, Google all expose SSE-style streaming endpoints. Most SDKs (Python, JS) make it one parameter to enable. The client iterates over the stream; each chunk is a partial token or special-event message.

Practical caveats: streaming complicates structured output (parsers need streaming JSON support), tool calling (tool_use events arrive interleaved), and error handling (errors mid-stream need different recovery than failed initial requests).

Examples

  • A ChatGPT-style web app: SSE stream rendering tokens as they arrive, TTFT ~0.6s vs full-wait of ~5s.
  • Anthropic's message API `stream=true`: each event is a delta the client appends to the rendered response.

Frequently asked

What is Streaming (LLM Responses)?

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

What is an example of streaming (llm responses)?

A ChatGPT-style web app: SSE stream rendering tokens as they arrive, TTFT ~0.6s vs full-wait of ~5s.

How is Streaming (LLM Responses) related to Time to First Token?

Streaming (LLM Responses) and Time to First Token are both inference concepts. Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Is Streaming (LLM Responses) considered beginner?

Streaming (LLM Responses) is generally considered beginner-level material in the AI and LLM space.

Time to First TokenInference

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Time per Output TokenInference

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

Structured OutputInference

Structured output constrains an LLM to emit text matching a schema — usually JSON. The model can be guaranteed to produce valid output that your code can parse without retries.

Side-by-side comparisons

Sources