Comparison

Time per Output Token vs Time to First Token

Time per Output Token and Time to First Token are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Time per Output Token

Time per Output Token comes up when the question is fundamentally about inference.

A 70B model on H100: TPOT ~25ms (~40 tokens/sec).

When you would reach for Time to First Token

Time to First Token comes up when the question is fundamentally about inference.

Claude with a 50K-token cached prefix: TTFT drops from ~6s to under 1s on subsequent calls reusing the same prefix.

Frequently asked

What is the difference between Time per Output Token and Time to First Token?

Time per Output Token: Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts. Time to First Token: Time to first token (TTFT) is how long it takes from sending a request until the first response token arrives. The user-perceived latency metric for streaming chat.

When should I use Time per Output Token vs Time to First Token?

Time per Output Token is the right concept when you are focused on inference. Time to First Token applies when you are focused on inference.

Are Time per Output Token and Time to First Token the same thing?

No. Time per Output Token is inference; Time to First Token is inference. They are related but address different parts of the AI stack.