Inference | HyperKit.ai

Definition

Inference is the process of using a trained model to generate outputs from inputs. When you send a prompt to an LLM and get a response, the model is performing inference. It's the "using" phase as opposed to the "training" phase.

Inference happens every time you interact with ChatGPT, Claude, or any AI assistant—the model applies what it learned during training to your specific request.

Key Concepts

Latency: Time from input to output (speed)
Throughput: Requests processed per second
Cost: Computational resources needed
Batch inference: Processing multiple inputs together

Examples

Training vs Inference

Two Phases of ML

TRAINING (done once or rarely)
├── Process: Learn from data
├── Time: Days to months
├── Cost: $$$$ to $$$$$$
├── Hardware: Many GPUs
├── Weights: Updated
└── When: Before deployment

INFERENCE (done constantly)
├── Process: Generate outputs
├── Time: Milliseconds to seconds
├── Cost: $ per request
├── Hardware: Can be single GPU
├── Weights: Frozen (no updates)
└── When: Every user request

Example:
• GPT-4 training: Months, millions of dollars
• GPT-4 inference: <1 second, fraction of a cent

Training creates the model; inference uses it.

Inference Metrics

What Matters for Production

LATENCY
Time to first token (TTFT): 200ms
Time per output token: 30ms
Total time for 100 tokens: 200 + (30 × 100) = 3.2s

THROUGHPUT
Tokens per second: 33 tokens/s
Requests per minute: varies by length

COST BREAKDOWN
Input tokens:  $0.01 per 1K tokens
Output tokens: $0.03 per 1K tokens
A 500 input + 500 output request:
  (500 × $0.01/1000) + (500 × $0.03/1000)
  = $0.005 + $0.015 = $0.02

SCALING FACTORS
• More users = need more compute
• Longer outputs = more cost
• Streaming = better UX, same cost

Understanding inference metrics helps optimize costs and user experience.

Inference Optimization

Making Inference Faster

TECHNIQUES TO SPEED UP INFERENCE:

1. QUANTIZATION
   Reduce precision (fp32 → int8)
   → Smaller model, faster math

2. BATCHING
   Process multiple requests together
   → Better GPU utilization

3. KV CACHING
   Cache key-value pairs for attention
   → Avoid recomputing for each token

4. SPECULATIVE DECODING
   Small model proposes, large verifies
   → Faster for certain patterns

5. DISTILLATION
   Train smaller model to mimic large one
   → Similar quality, much faster

Production systems use many techniques to optimize inference.

Interactive Exercise

⚡

Calculate Inference Cost

You're building a chatbot with these specs:

Average prompt: 500 tokens
Average response: 300 tokens
Expected: 10,000 conversations/day
Input cost: $0.01/1K tokens, Output: $0.03/1K tokens

What's your daily inference cost?

Pro Tips

Output tokens typically cost 2-3x more than input tokens
Shorter prompts = lower latency AND lower cost
Consider caching frequent responses
Monitor token usage to catch runaway costs

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms