Inference / Decoding Strategies

Nucleus Sampling

Intermediate [3/5]
Top-p sampling Dynamic top-k

Definition

Nucleus sampling (top-p) selects from the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k which fixes the number of candidates, nucleus sampling adapts based on the probability distribution's shape.

This creates a dynamic "nucleus" of probable tokens, allowing more diversity when the model is uncertain and less when it's confident.

Key Concepts

  • Top-p threshold: Cumulative probability cutoff (typically 0.9-0.95)
  • Adaptive selection: Number of tokens varies per step
  • Probability mass: Includes tokens that form p% of total probability
  • Combined with temperature: Often used together for fine control

Examples

Mechanism
How Nucleus Sampling Works
NUCLEUS SAMPLING (top-p = 0.9): STEP 1: Sort tokens by probability (descending) Token Prob Cumulative "the" 0.40 0.40 "a" 0.25 0.65 "some" 0.15 0.80 "this" 0.08 0.88 "that" 0.05 0.93 ← exceeds 0.9 here "one" 0.03 0.96 "any" 0.02 0.98 ... ... ... STEP 2: Include tokens until cumsum > p Nucleus = {"the", "a", "some", "this", "that"} (5 tokens, captures 93% probability mass) STEP 3: Renormalize and sample P("the") = 0.40/0.93 = 0.43 P("a") = 0.25/0.93 = 0.27 P("some") = 0.15/0.93 = 0.16 P("this") = 0.08/0.93 = 0.09 P("that") = 0.05/0.93 = 0.05 ADAPTIVE BEHAVIOR: High confidence (peaked distribution): "Paris" 0.95 → nucleus = {"Paris"} (1 token!) Low confidence (flat distribution): Each token ~0.05 → nucleus = 18 tokens TOP-K vs TOP-P: top_k=5 always picks 5 tokens (fixed) top_p=0.9 picks 1-1000+ tokens (adaptive)
Implementation
Using Nucleus Sampling
API USAGE: # OpenAI / Claude response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.7, top_p=0.9, # nucleus sampling # top_k not typically exposed ) # HuggingFace output = model.generate( input_ids, do_sample=True, top_p=0.92, top_k=0, # disable top-k, use only top-p temperature=0.8 ) COMMON CONFIGURATIONS: Creative writing: temperature=0.9, top_p=0.95 Balanced conversation: temperature=0.7, top_p=0.9 Factual/consistent: temperature=0.3, top_p=0.85 Code generation: temperature=0.2, top_p=0.95 IMPORTANT: - top_p=1.0 = no filtering (full distribution) - top_p=0.0 = only highest prob (greedy) - Don't use both high temp AND high top_p (can produce garbage)

Interactive Exercise

Compare Top-K vs Top-P

Given probabilities [0.5, 0.3, 0.1, 0.05, 0.03, 0.02], how many tokens would be selected with top_k=3 vs top_p=0.9?

Pro Tips
  • top_p=0.9 is a good default for most applications
  • Combine with temperature: lower temp + higher top_p = focused but diverse
  • For deterministic output, use temperature=0 instead of top_p
  • Nucleus sampling prevents rare token selection in flat distributions

Related Terms