Inference · intermediate

Top-p (nucleus sampling)

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

Published May 29, 2026

Explanation

If the model assigns 50% probability to "the", 20% to "a", 10% to "this", and tiny amounts to thousands of others, top-p=0.8 would only sample from {the, a, this} (whose cumulative probability is 80%).

This avoids the failure mode where the model occasionally picks a wildly unlikely token from the long tail (which degrades coherence) while still allowing diversity within the high-probability region.

Top-p is generally preferred over top-k because the size of the sampling set adapts to how confident the model is.

Examples

top-p = 0.9: typical for chat assistants.
top-p = 1.0: no filtering, sample from full distribution.

Frequently asked

What is Top-p?

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

What is an example of top-p?

top-p = 0.9: typical for chat assistants.

How is Top-p related to Temperature?

Top-p and Temperature are both inference concepts. Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Is Top-p considered intermediate?

Top-p is generally considered intermediate-level material in the AI and LLM space.

TemperatureInference

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Top-kInference

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Side-by-side comparisons

Sources

The Curious Case of Neural Text Degeneration (arXiv)