Model Parameters / Sampling Strategies

Top-K Sampling

Beginner [2/5]
K-truncation Top-k filtering

Definition

Top-K sampling restricts the model's next-token selection to only the K most probable tokens. All other tokens are excluded before sampling, preventing the selection of unlikely (potentially nonsensical) tokens.

Lower K values produce more focused, predictable text; higher K values allow more diversity and creativity.

Key Concepts

  • K parameter: Number of top tokens to consider (e.g., 40, 50)
  • Probability redistribution: Remaining mass redistributed among top-K
  • Fixed cutoff: Always exactly K options, regardless of distribution
  • Diversity control: Balances creativity vs. coherence

Examples

Visualization
How Top-K Works
Context: "The cat sat on the ___" Token probabilities (before top-K): mat ████████████████ 35% floor ████████████ 25% couch ████████ 15% bed ██████ 10% roof ████ 5% table ███ 4% hat ██ 2% moon █ 1% pizza ▏ 0.5% ... ▏ ... With K=4 (only top 4 tokens): mat ████████████████████ 41% (35/85) floor ███████████████ 29% (25/85) couch ██████████ 18% (15/85) bed ████████ 12% (10/85) [all others = 0%] Probabilities renormalized to sum to 100% among only the top 4 candidates!
API Usage
Setting Top-K
# OpenAI (uses logit_bias or not directly supported) # Claude API response = anthropic.messages.create( model="claude-3-opus", max_tokens=100, top_k=40, # Consider top 40 tokens messages=[{"role": "user", "content": "Write a poem"}] ) # HuggingFace from transformers import pipeline generator = pipeline("text-generation", model="gpt2") output = generator( "The future of AI", top_k=50, # Top 50 tokens do_sample=True, max_length=100 ) # Common top_k values: # K=1: Greedy (deterministic) # K=10-20: Very focused # K=40-50: Balanced (common default) # K=100+: More diverse/creative

Interactive Exercise

Choose the Right K

What K value would you use for each scenario?

1. Generating legal contract text
2. Creative story writing
3. Code completion

Pro Tips
  • Top-K is often combined with temperature for finer control
  • Consider top-P (nucleus) sampling as an adaptive alternative
  • K=1 is equivalent to greedy decoding (always pick best)
  • Very high K with low temperature ≈ lower K with high temp

Related Terms