vLLM | HyperKit.ai

Definition

vLLM is a high-throughput, memory-efficient LLM inference and serving library. Its key innovation is PagedAttention, which manages the key-value (KV) cache like virtual memory, enabling efficient batching of requests with different sequence lengths.

vLLM can achieve 2-24x higher throughput compared to traditional serving methods, making it popular for production LLM deployments.

Key Concepts

PagedAttention: Manages KV cache in non-contiguous blocks
Continuous batching: Dynamically add/remove requests from batch
Memory efficiency: Near-optimal GPU memory utilization
OpenAI-compatible API: Drop-in replacement for API calls

Examples

Innovation

PagedAttention Explained

THE PROBLEM: KV CACHE MEMORY WASTE

Traditional approach:
┌─────────────────────────────────────────────────┐
│ Request 1: "Hello" (5 tokens, needs 100 tokens) │
│ KV Cache: [█████░░░░░░░░░░░░░░░...100 slots]   │
│           Used: 5%  Wasted: 95%                 │
├─────────────────────────────────────────────────┤
│ Request 2: "Hi there" (8 tokens, needs 100)     │
│ KV Cache: [████████░░░░░░░░░░░░...100 slots]   │
│           Used: 8%  Wasted: 92%                 │
└─────────────────────────────────────────────────┘

Problem: Must pre-allocate max length!
→ Can't fit many requests in GPU memory
→ Memory is mostly empty/wasted

---

PAGEDATTENTION SOLUTION:

Like virtual memory in operating systems!

Physical GPU Memory (blocks):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ A1│ B1│ A2│ C1│ B2│ A3│ C2│ B3│
└───┴───┴───┴───┴───┴───┴───┴───┘

Logical view per request:
Request A: [A1] → [A2] → [A3] (non-contiguous!)
Request B: [B1] → [B2] → [B3]
Request C: [C1] → [C2]

Benefits:
✓ Blocks allocated on-demand as tokens generated
✓ No pre-allocation waste
✓ Blocks can be anywhere in memory
✓ Easy to share common prefixes

MEMORY COMPARISON:
┌─────────────────────┬────────────┬─────────────┐
│ Metric              │ Traditional│ vLLM        │
├─────────────────────┼────────────┼─────────────┤
│ Memory utilization  │ 20-40%     │ >90%        │
│ Concurrent requests │ 10-50      │ 100-500+    │
│ Throughput          │ 1x         │ 2-24x       │
└─────────────────────┴────────────┴─────────────┘

Usage

Running vLLM Server

BASIC VLLM USAGE:

# Install vLLM
pip install vllm

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000

# Now use with standard OpenAI client!
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Local server
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

---

ADVANCED CONFIGURATION:

# High-throughput serving
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --tensor-parallel-size 2 \      # Multi-GPU
    --gpu-memory-utilization 0.9 \  # Use 90% GPU mem
    --max-num-seqs 256 \            # Max concurrent
    --quantization awq              # Use quantized model

---

PYTHON API (Offline batch inference):

from vllm import LLM, SamplingParams

# Initialize
llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Batch inference (much faster than sequential!)
prompts = [
    "What is AI?",
    "Explain quantum computing.",
    "Write a haiku about coding.",
    # ... hundreds more
]

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Process all at once - vLLM batches efficiently
outputs = llm.generate(prompts, sampling)

for output in outputs:
    print(output.outputs[0].text)

---

PERFORMANCE COMPARISON:

Task: Generate 1000 responses (avg 100 tokens each)

Naive sequential (HuggingFace):
Time: ~45 minutes
GPU utilization: 30-40%

vLLM batched:
Time: ~3 minutes
GPU utilization: 85-95%

SPEEDUP: ~15x faster!

Interactive Exercise

✎

Choose Serving Strategy

You need to serve a 7B parameter model. You have: 1 GPU with 24GB VRAM, expected traffic of 50 concurrent users, and need to support up to 4K token context. Would you use vLLM, and how would you configure it?

Pro Tips

Use quantization (AWQ, GPTQ) to fit larger models in limited VRAM
Set gpu-memory-utilization to 0.85-0.95 for production
Monitor KV cache usage to tune max-num-seqs
vLLM's OpenAI-compatible API makes migration easy

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms