Beam Search

LLMs

A decoding strategy that keeps the top-k highest-probability partial sequences at each step instead of greedily picking one.


In one line

Track the top-k most likely partial sequences at every decoding step, expand each, and keep the best k overall.

What it actually means

Greedy decoding picks the single most likely next token at each step and commits. That’s locally optimal but frequently globally bad — a slightly lower-probability token now might open up a much better continuation. Beam search with width k keeps k candidate sequences alive, expands each by one token, scores all k * V continuations, and prunes back to the top k. At the end you take the highest-scoring completed sequence. Beam size 1 is greedy; beam size 4–8 is typical for translation.

Why it matters

Beam search was the default for seq2seq translation and captioning systems. For chat-style LLMs it has mostly been replaced by temperature/top-p sampling because beam search produces overly generic, repetitive text — the “bland beam” problem. Still useful when you want deterministic, high-likelihood output: structured generation, translation, code completion with a strict grammar.

Example

outputs = model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    early_stopping=True,
)

You’ll hear it when

  • Implementing machine translation or summarization.
  • Debating beam search vs nucleus sampling for a generation task.
  • Reading decoding-strategy sections of LLM papers.
  • Constrained decoding for structured outputs.

Related terms

See also