Inference · intermediate

Beam Search

Beam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.

Published May 29, 2026

Explanation

At each step beam search expands every current candidate by every possible next token, scores all resulting sequences by cumulative log-probability, and keeps the top beam_width of them. Output is the highest-scoring complete sequence.

Strong for sequence-to-sequence tasks with a clear correct answer (machine translation, summarization eval). Rarely used in open-ended chat — it tends to produce safe, generic completions and is much more expensive than sampling.

Examples

Translation systems with beam width 4-10.
Decoding for ASR (speech-to-text).

Frequently asked

What is Beam Search?

Beam search explores several candidate continuations in parallel, keeping the top-k partial sequences at each step. Common in translation; rare in modern LLM chat.

What is an example of beam search?

Translation systems with beam width 4-10.

How is Beam Search related to Greedy Decoding?

Beam Search and Greedy Decoding are both inference concepts. Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

Is Beam Search considered intermediate?

Beam Search is generally considered intermediate-level material in the AI and LLM space.

Greedy DecodingInference

Greedy decoding always picks the single highest-probability next token. It is deterministic, fast, and often dull.

SamplingInference

Sampling is the act of choosing the next token from the model's output distribution, typically after applying temperature and a truncation strategy like top-p or top-k.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Side-by-side comparisons

Sources

Hugging Face — Generation strategies