Beam Search
A decoding strategy that keeps the top-k highest-probability partial sequences at each step instead of greedily picking one.
In one line
Track the top-k most likely partial sequences at every decoding step, expand each, and keep the best k overall.
What it actually means
Greedy decoding picks the single most likely next token at each step and commits. That’s locally optimal but frequently globally bad — a slightly lower-probability token now might open up a much better continuation. Beam search with width k keeps k candidate sequences alive, expands each by one token, scores all k * V continuations, and prunes back to the top k. At the end you take the highest-scoring completed sequence. Beam size 1 is greedy; beam size 4–8 is typical for translation.
Why it matters
Beam search was the default for seq2seq translation and captioning systems. For chat-style LLMs it has mostly been replaced by temperature/top-p sampling because beam search produces overly generic, repetitive text — the “bland beam” problem. Still useful when you want deterministic, high-likelihood output: structured generation, translation, code completion with a strict grammar.
Example
outputs = model.generate(
input_ids,
num_beams=4,
no_repeat_ngram_size=3,
early_stopping=True,
)
You’ll hear it when
- Implementing machine translation or summarization.
- Debating beam search vs nucleus sampling for a generation task.
- Reading decoding-strategy sections of LLM papers.
- Constrained decoding for structured outputs.