Inference · intermediate

Time to First Token (TTFT)

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Published May 31, 2026

Explanation

In a streaming chat UI, the cursor stops blinking and the response starts flowing the moment TTFT lands. For users it is the latency that matters; how long the full response takes is secondary as long as tokens keep flowing.

TTFT is dominated by prompt processing (the "prefill" pass) — the model has to ingest the full prompt before it can produce token one. For a long-context call (100K tokens of context, short response), TTFT can be many seconds even though the response is fast.

Optimization levers: prompt caching (Anthropic / OpenAI / Gemini all support cached prefixes that skip prefill), shorter system prompts, smaller models routed by a router, and serving with prefix-aware engines like vLLM.

Examples

Claude with a 50K-token cached prefix: TTFT drops from ~6s to under 1s on subsequent calls reusing the same prefix.
Routing simple FAQ queries to Haiku for ~200ms TTFT instead of Sonnet at ~600ms.

Frequently asked

What is Time to First Token?

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

What is an example of time to first token?

Claude with a 50K-token cached prefix: TTFT drops from ~6s to under 1s on subsequent calls reusing the same prefix.

How is Time to First Token related to Time per Output Token?

Time to First Token and Time per Output Token are both inference concepts. Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

Is Time to First Token considered intermediate?

Time to First Token is generally considered intermediate-level material in the AI and LLM space.

Time per Output TokenInference

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

Streaming (LLM Responses)Inference

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

Prompt CachingInference

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

Side-by-side comparisons

Sources

Anthropic — Prompt caching