Inference · intermediate

Sampling

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

Published May 29, 2026

Explanation

An LLM produces a probability for every token in its vocabulary. Sampling is how a concrete token gets picked from that distribution. The simplest strategy is greedy (always pick the top one); the others (temperature + top-p/top-k) introduce controlled randomness.

Sampling choices dominate the "feel" of model output more than people realize. A bad sampling configuration can make an excellent model sound dull or unhinged.

Examples

OpenAI default: temperature 1.0, top-p 1.0.
Anthropic default: temperature 1.0 with sane truncation.

Frequently asked

What is Sampling?

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

What is an example of sampling?

OpenAI default: temperature 1.0, top-p 1.0.

How is Sampling related to Temperature?

Sampling and Temperature are both inference concepts. Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Is Sampling considered intermediate?

Sampling is generally considered intermediate-level material in the AI and LLM space.

TemperatureInference

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

Top-pInference

Top-p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability reaches p. Common values are 0.9-0.95.

Top-kInference

Top-k restricts token sampling to the k highest-probability tokens, then samples from that set. A simpler alternative to top-p.

Greedy DecodingInference

Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

Beam SearchInference

Beam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.

Side-by-side comparisons

Sources

Hugging Face — Generation strategies