Inference · intermediate
Time to First Token (TTFT)
Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.
Explanation
In a streaming chat UI, the cursor stops blinking and the response starts flowing the moment TTFT lands. For users it is the latency that matters; how long the full response takes is secondary as long as tokens keep flowing.
TTFT is dominated by prompt processing (the "prefill" pass) — the model has to ingest the full prompt before it can produce token one. For a long-context call (100K tokens of context, short response), TTFT can be many seconds even though the response is fast.
Optimization levers: prompt caching (Anthropic / OpenAI / Gemini all support cached prefixes that skip prefill), shorter system prompts, smaller models routed by a router, and serving with prefix-aware engines like vLLM.
Examples
- Claude with a 50K-token cached prefix: TTFT drops from ~6s to under 1s on subsequent calls reusing the same prefix.
- Routing simple FAQ queries to Haiku for ~200ms TTFT instead of Sonnet at ~600ms.
Frequently asked
What is Time to First Token?
Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.
What is an example of time to first token?
Claude with a 50K-token cached prefix: TTFT drops from ~6s to under 1s on subsequent calls reusing the same prefix.
How is Time to First Token related to Time per Output Token?
Time to First Token and Time per Output Token are both inference concepts. Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.
Is Time to First Token considered intermediate?
Time to First Token is generally considered intermediate-level material in the AI and LLM space.