Inference / Decoding Strategies

Beam Search

Intermediate [3/5]
Beam decoding K-best search

Definition

Beam search is a decoding algorithm that keeps track of the k most probable partial sequences (beams) at each generation step, rather than just the single best option. It balances between greedy search (k=1) and exhaustive search.

Beam search is commonly used for translation and summarization where output quality matters more than diversity.

Key Concepts

  • Beam width (k): Number of candidates kept at each step
  • Log probability: Sequences scored by sum of log probs
  • Length penalty: Prevents preference for short sequences
  • Early stopping: Stop when k beams reach end token

Examples

Algorithm
Beam Search Visualization (k=2)
BEAM SEARCH WITH WIDTH 2: Start: "The" ┌─────────────────┐ Step 1: P(token | "The") "cat" (0.3) ←── keep "dog" (0.25) ←── keep "quick" (0.1) ── drop Beams: ["The cat", "The dog"] Step 2: Expand BOTH beams: "The cat sat" (0.3 × 0.4 = 0.12) ←── keep "The cat is" (0.3 × 0.3 = 0.09) "The dog ran" (0.25 × 0.35 = 0.087) "The dog is" (0.25 × 0.25 = 0.062) ←── keep (2nd best) Beams: ["The cat sat", "The dog is"] Step 3: Continue expanding... ...until EOS or max length GREEDY (k=1) vs BEAM (k>1): ┌─────────────┬──────────────────────────┐ │ Greedy │ "The cat sat quietly." │ │ Beam (k=2) │ "The cat sat on the mat."│ │ Beam (k=5) │ "The cat sat by the fire"│ └─────────────┴──────────────────────────┘ Beam search finds better global solutions!
Implementation
Beam Search Parameters
BEAM SEARCH CONFIGURATION: # Basic beam search output = model.generate( input_ids, num_beams=5, # beam width early_stopping=True, # stop when all beams finish max_length=100 ) LENGTH PENALTY: score = log_prob / length^α α = 0: No penalty (prefers short) α = 1: Linear penalty (neutral) α > 1: Prefers longer sequences # With length penalty output = model.generate( input_ids, num_beams=5, length_penalty=1.2, # favor longer no_repeat_ngram_size=3 # avoid repetition ) DIVERSITY (diverse beam search): output = model.generate( input_ids, num_beams=6, num_beam_groups=3, # 3 groups of 2 diversity_penalty=0.5 # penalize similar beams ) WHEN TO USE BEAM SEARCH: ✓ Machine translation ✓ Summarization ✓ Code generation (structured output) ✗ Creative writing (too deterministic) ✗ Chatbots (lacks diversity)

Interactive Exercise

Calculate Beam Candidates

With beam width k=3 and vocabulary size 50,000, how many candidates are evaluated at each step before pruning to k?

Pro Tips
  • Beam width 4-5 is usually sufficient; larger rarely helps
  • Beam search can produce repetitive outputs without n-gram blocking
  • For chatbots, sampling methods often outperform beam search
  • Contrastive search combines beam search quality with diversity

Related Terms