Inference Optimization / Model Acceleration

Speculative Decoding

Expert [5/5]
Speculative sampling Draft-then-verify Assisted generation

Definition

Speculative Decoding is an inference acceleration technique where a smaller, faster "draft" model generates candidate tokens that a larger "target" model then verifies in parallel. This can significantly speed up generation while maintaining the exact output distribution of the target model.

The key insight is that verifying multiple tokens in parallel is faster than generating them one at a time with the large model.

Key Concepts

  • Draft model: Small, fast model that proposes tokens
  • Target model: Large, accurate model that verifies
  • Parallel verification: Check multiple tokens at once
  • Distribution matching: Output identical to target-only

Examples

Process
How Speculative Decoding Works
STANDARD AUTOREGRESSIVE DECODING: Prompt: "The capital of France is" Step 1: Model(prompt) → "Paris" [~100ms] Step 2: Model(prompt + Paris) → "." [~100ms] Step 3: Model(...Paris.) → "It" [~100ms] Step 4: Model(...It) → "is" [~100ms] ... Total: 1 token per forward pass Time: N tokens × ~100ms = slow for large models --- SPECULATIVE DECODING: Prompt: "The capital of France is" DRAFT PHASE (small model, fast): Draft model generates K tokens quickly: → "Paris. It is a beautiful" [~20ms total] VERIFY PHASE (large model, parallel): Target model verifies all K tokens at once: Input: prompt + "Paris. It is a beautiful" For each position, check: P_target(token) vs P_draft(token) Result: - "Paris" ✓ accepted (high prob in target) - "." ✓ accepted - "It" ✓ accepted - "is" ✓ accepted - "a" ✓ accepted - "beautiful" ✗ rejected (target prefers "major") → regenerate from here One forward pass verified 5 tokens! [~100ms] --- SPEEDUP VISUALIZATION: Standard (7B model): Token: |===|===|===|===|===|===|===|===| Time: 100 100 100 100 100 100 100 100 = 800ms Speculative (1B draft + 7B verify): Draft: |=| 5 tokens in 20ms Verify: |========| all 5 verified in 100ms Next: |=| draft again... Time: ~240ms for same 8 tokens SPEEDUP: 2-3× typical, up to 5× for easy text ACCEPTANCE RATE: - Easy text (common patterns): ~80-90% accepted - Hard text (creative/rare): ~50-60% accepted - Higher acceptance = more speedup
Algorithm
Speculative Sampling Algorithm
SPECULATIVE SAMPLING ALGORITHM: def speculative_decode(prompt, draft_model, target_model, K=5): output = prompt while not done: # Step 1: Draft K tokens with small model draft_tokens = [] draft_probs = [] for _ in range(K): p_draft = draft_model(output + draft_tokens) token = sample(p_draft) draft_tokens.append(token) draft_probs.append(p_draft[token]) # Step 2: Get target probs for all positions (parallel) target_probs = target_model.batch_forward( output, draft_tokens ) # Step 3: Accept/reject each token accepted = 0 for i, token in enumerate(draft_tokens): p_t = target_probs[i][token] p_d = draft_probs[i] # Accept with probability min(1, p_target/p_draft) if random() < min(1, p_t / p_d): output.append(token) accepted += 1 else: # Reject: sample from adjusted distribution adjusted = max(0, target_probs[i] - draft_probs[i]) adjusted = adjusted / sum(adjusted) new_token = sample(adjusted) output.append(new_token) break # Stop accepting, restart draft return output --- KEY INSIGHT - WHY IT'S LOSSLESS: The acceptance criterion: accept if random() < min(1, p_target / p_draft) This guarantees the final output has EXACTLY the same distribution as standard decoding with the target model alone. When draft agrees with target: accept most tokens When draft disagrees: reject and sample correctly --- CHOOSING DRAFT MODEL: Good draft models: - Same architecture, fewer layers (7B → 1B) - Same tokenizer (critical!) - Similar training data - Distilled from target model Examples: - Llama-70B target + Llama-7B draft - GPT-4 target + GPT-3.5 draft - Custom distilled small model SPEEDUP FACTORS: ┌─────────────────────┬─────────────────────────┐ │ Factor │ Impact on Speedup │ ├─────────────────────┼─────────────────────────┤ │ Draft/Target ratio │ Smaller draft = faster │ │ Acceptance rate │ Higher = more speedup │ │ K (draft length) │ Tune for acceptance │ │ Text difficulty │ Easy text = more speedup│ └─────────────────────┴─────────────────────────┘

Interactive Exercise

Calculate Speedup

Your target model takes 100ms per token. Your draft model takes 10ms per token and has an 80% acceptance rate (on average 4 out of 5 drafted tokens are accepted). If you draft K=5 tokens per cycle, what's the approximate speedup?

Pro Tips
  • Match tokenizers between draft and target (critical!)
  • Tune K based on acceptance rate - higher acceptance allows larger K
  • Speedup is task-dependent - measure on your actual workload
  • Works best when draft and target distributions are similar

Related Terms