Skip to main content
ModelTerms

Inference · intermediate

Time per Output Token (TPOT, inter-token latency)

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

Explanation

Once the model produces token one (after TTFT), TPOT governs the streaming speed. A TPOT of 30 ms means roughly 33 tokens per second — about typing-speed for the user. Frontier models commonly hit 50-150 tokens/sec on a single GPU stream; specialized hardware (Groq, Cerebras) pushes well past 500.

TPOT is dominated by per-token decode compute — one forward pass through the model for each new token, gated by GPU memory bandwidth more than FLOPs. Speculative decoding, smaller draft models, and quantization all attack this.

TTFT + (output_tokens × TPOT) ≈ total response time. For long responses, TPOT dominates; for short responses, TTFT dominates.

Examples

  • A 70B model on H100: TPOT ~25ms (~40 tokens/sec).
  • A speculative decoded 70B with an 8B drafter: TPOT effectively halved.

Frequently asked

What is Time per Output Token?

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

What is an example of time per output token?

A 70B model on H100: TPOT ~25ms (~40 tokens/sec).

How is Time per Output Token related to Time to First Token?

Time per Output Token and Time to First Token are both inference concepts. Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Is Time per Output Token considered intermediate?

Time per Output Token is generally considered intermediate-level material in the AI and LLM space.

Time to First TokenInference

Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

Streaming (LLM Responses)Inference

Streaming returns tokens to the client as they're generated rather than holding the full response until completion. Implemented over Server-Sent Events (SSE) or WebSocket; what makes chat UIs feel fast.

Speculative DecodingInference

Speculative decoding speeds up generation by having a small "draft" model propose several tokens, then verifying them in a single batched call to the big model.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Side-by-side comparisons

Sources