Speculative Decoding | HyperKit.ai

Definition

Speculative Decoding is an inference acceleration technique where a smaller, faster "draft" model generates candidate tokens that a larger "target" model then verifies in parallel. This can significantly speed up generation while maintaining the exact output distribution of the target model.

The key insight is that verifying multiple tokens in parallel is faster than generating them one at a time with the large model.

Key Concepts

Draft model: Small, fast model that proposes tokens
Target model: Large, accurate model that verifies
Parallel verification: Check multiple tokens at once
Distribution matching: Output identical to target-only

Examples

Process

How Speculative Decoding Works

STANDARD AUTOREGRESSIVE DECODING:

Prompt: "The capital of France is"

Step 1: Model(prompt) → "Paris"     [~100ms]
Step 2: Model(prompt + Paris) → "." [~100ms]
Step 3: Model(...Paris.) → "It"     [~100ms]
Step 4: Model(...It) → "is"         [~100ms]
...

Total: 1 token per forward pass
Time: N tokens × ~100ms = slow for large models

---

SPECULATIVE DECODING:

Prompt: "The capital of France is"

DRAFT PHASE (small model, fast):
Draft model generates K tokens quickly:
→ "Paris. It is a beautiful"  [~20ms total]

VERIFY PHASE (large model, parallel):
Target model verifies all K tokens at once:
Input: prompt + "Paris. It is a beautiful"

For each position, check:
P_target(token) vs P_draft(token)

Result:
- "Paris" ✓ accepted (high prob in target)
- "."     ✓ accepted
- "It"    ✓ accepted
- "is"    ✓ accepted
- "a"     ✓ accepted
- "beautiful" ✗ rejected (target prefers "major")
              → regenerate from here

One forward pass verified 5 tokens! [~100ms]

---

SPEEDUP VISUALIZATION:

Standard (7B model):
Token: |===|===|===|===|===|===|===|===|
Time:  100 100 100 100 100 100 100 100 = 800ms

Speculative (1B draft + 7B verify):
Draft:  |=| 5 tokens in 20ms
Verify: |========| all 5 verified in 100ms
Next:   |=| draft again...
Time: ~240ms for same 8 tokens

SPEEDUP: 2-3× typical, up to 5× for easy text

ACCEPTANCE RATE:
- Easy text (common patterns): ~80-90% accepted
- Hard text (creative/rare): ~50-60% accepted
- Higher acceptance = more speedup

Algorithm

Speculative Sampling Algorithm

SPECULATIVE SAMPLING ALGORITHM:

def speculative_decode(prompt, draft_model, target_model, K=5):
    output = prompt

    while not done:
        # Step 1: Draft K tokens with small model
        draft_tokens = []
        draft_probs = []
        for _ in range(K):
            p_draft = draft_model(output + draft_tokens)
            token = sample(p_draft)
            draft_tokens.append(token)
            draft_probs.append(p_draft[token])

        # Step 2: Get target probs for all positions (parallel)
        target_probs = target_model.batch_forward(
            output,
            draft_tokens
        )

        # Step 3: Accept/reject each token
        accepted = 0
        for i, token in enumerate(draft_tokens):
            p_t = target_probs[i][token]
            p_d = draft_probs[i]

            # Accept with probability min(1, p_target/p_draft)
            if random() < min(1, p_t / p_d):
                output.append(token)
                accepted += 1
            else:
                # Reject: sample from adjusted distribution
                adjusted = max(0, target_probs[i] - draft_probs[i])
                adjusted = adjusted / sum(adjusted)
                new_token = sample(adjusted)
                output.append(new_token)
                break  # Stop accepting, restart draft

    return output

---

KEY INSIGHT - WHY IT'S LOSSLESS:

The acceptance criterion:
  accept if random() < min(1, p_target / p_draft)

This guarantees the final output has EXACTLY the
same distribution as standard decoding with the
target model alone.

When draft agrees with target: accept most tokens
When draft disagrees: reject and sample correctly

---

CHOOSING DRAFT MODEL:

Good draft models:
- Same architecture, fewer layers (7B → 1B)
- Same tokenizer (critical!)
- Similar training data
- Distilled from target model

Examples:
- Llama-70B target + Llama-7B draft
- GPT-4 target + GPT-3.5 draft
- Custom distilled small model

SPEEDUP FACTORS:
┌─────────────────────┬─────────────────────────┐
│ Factor              │ Impact on Speedup       │
├─────────────────────┼─────────────────────────┤
│ Draft/Target ratio  │ Smaller draft = faster  │
│ Acceptance rate     │ Higher = more speedup   │
│ K (draft length)    │ Tune for acceptance     │
│ Text difficulty     │ Easy text = more speedup│
└─────────────────────┴─────────────────────────┘

Interactive Exercise

✎

Calculate Speedup

Your target model takes 100ms per token. Your draft model takes 10ms per token and has an 80% acceptance rate (on average 4 out of 5 drafted tokens are accepted). If you draft K=5 tokens per cycle, what's the approximate speedup?

Pro Tips

Match tokenizers between draft and target (critical!)
Tune K based on acceptance rate - higher acceptance allows larger K
Speedup is task-dependent - measure on your actual workload
Works best when draft and target distributions are similar

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms