Inference · intermediate
Time per Output Token (TPOT, inter-token latency)
Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.
Explanation
Once the model produces token one (after TTFT), TPOT governs the streaming speed. A TPOT of 30 ms means roughly 33 tokens per second — about typing-speed for the user. Frontier models commonly hit 50-150 tokens/sec on a single GPU stream; specialized hardware (Groq, Cerebras) pushes well past 500.
TPOT is dominated by per-token decode compute — one forward pass through the model for each new token, gated by GPU memory bandwidth more than FLOPs. Speculative decoding, smaller draft models, and quantization all attack this.
TTFT + (output_tokens × TPOT) ≈ total response time. For long responses, TPOT dominates; for short responses, TTFT dominates.
Examples
- A 70B model on H100: TPOT ~25ms (~40 tokens/sec).
- A speculative decoded 70B with an 8B drafter: TPOT effectively halved.
Frequently asked
What is Time per Output Token?
Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.
What is an example of time per output token?
A 70B model on H100: TPOT ~25ms (~40 tokens/sec).
How is Time per Output Token related to Time to First Token?
Time per Output Token and Time to First Token are both inference concepts. Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.
Is Time per Output Token considered intermediate?
Time per Output Token is generally considered intermediate-level material in the AI and LLM space.