Nucleus Sampling | HyperKit.ai

Definition

Nucleus sampling (top-p) selects from the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k which fixes the number of candidates, nucleus sampling adapts based on the probability distribution's shape.

This creates a dynamic "nucleus" of probable tokens, allowing more diversity when the model is uncertain and less when it's confident.

Key Concepts

Top-p threshold: Cumulative probability cutoff (typically 0.9-0.95)
Adaptive selection: Number of tokens varies per step
Probability mass: Includes tokens that form p% of total probability
Combined with temperature: Often used together for fine control

Examples

Mechanism

How Nucleus Sampling Works

NUCLEUS SAMPLING (top-p = 0.9):

STEP 1: Sort tokens by probability (descending)
Token       Prob    Cumulative
"the"       0.40    0.40
"a"         0.25    0.65
"some"      0.15    0.80
"this"      0.08    0.88
"that"      0.05    0.93  ← exceeds 0.9 here
"one"       0.03    0.96
"any"       0.02    0.98
...         ...     ...

STEP 2: Include tokens until cumsum > p
Nucleus = {"the", "a", "some", "this", "that"}
(5 tokens, captures 93% probability mass)

STEP 3: Renormalize and sample
P("the")  = 0.40/0.93 = 0.43
P("a")    = 0.25/0.93 = 0.27
P("some") = 0.15/0.93 = 0.16
P("this") = 0.08/0.93 = 0.09
P("that") = 0.05/0.93 = 0.05

ADAPTIVE BEHAVIOR:
High confidence (peaked distribution):
  "Paris" 0.95 → nucleus = {"Paris"} (1 token!)

Low confidence (flat distribution):
  Each token ~0.05 → nucleus = 18 tokens

TOP-K vs TOP-P:
top_k=5 always picks 5 tokens (fixed)
top_p=0.9 picks 1-1000+ tokens (adaptive)

Implementation

Using Nucleus Sampling

API USAGE:

# OpenAI / Claude
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,
    top_p=0.9,        # nucleus sampling
    # top_k not typically exposed
)

# HuggingFace
output = model.generate(
    input_ids,
    do_sample=True,
    top_p=0.92,
    top_k=0,          # disable top-k, use only top-p
    temperature=0.8
)

COMMON CONFIGURATIONS:

Creative writing:
  temperature=0.9, top_p=0.95

Balanced conversation:
  temperature=0.7, top_p=0.9

Factual/consistent:
  temperature=0.3, top_p=0.85

Code generation:
  temperature=0.2, top_p=0.95

IMPORTANT:
- top_p=1.0 = no filtering (full distribution)
- top_p=0.0 = only highest prob (greedy)
- Don't use both high temp AND high top_p
  (can produce garbage)

Interactive Exercise

✎

Compare Top-K vs Top-P

Given probabilities [0.5, 0.3, 0.1, 0.05, 0.03, 0.02], how many tokens would be selected with top_k=3 vs top_p=0.9?

Pro Tips

top_p=0.9 is a good default for most applications
Combine with temperature: lower temp + higher top_p = focused but diverse
For deterministic output, use temperature=0 instead of top_p
Nucleus sampling prevents rare token selection in flat distributions

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms