Inference · intermediate

Top-k

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

Published May 29, 2026

Explanation

Pick k (say, 50) and at each step the model only ever samples from the 50 most-likely next tokens. This caps the worst-case "weird token" failure mode.

Top-k is less adaptive than top-p: it always considers exactly k tokens, regardless of how confident or uncertain the model is at that step. Most modern systems prefer top-p for that reason, though top-k still appears in older codebases and some open-source defaults.

Examples

top-k = 50: a common default in Hugging Face generation.
top-k = 1: same as greedy decoding (always pick the top token).

Frequently asked

What is Top-k?

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

What is an example of top-k?

top-k = 50: a common default in Hugging Face generation.

How is Top-k related to Top-p?

Top-k and Top-p are both inference concepts. Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

Is Top-k considered intermediate?

Top-k is generally considered intermediate-level material in the AI and LLM space.

Top-pInference

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

TemperatureInference

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

Greedy DecodingInference

Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

Side-by-side comparisons

Sources

Hierarchical Neural Story Generation (arXiv)