Model Architecture & Training / Deployment & Operations

Inference

Beginner [2/5]
Model inference Prediction Forward pass

Definition

Inference is the process of using a trained model to generate outputs from inputs. When you send a prompt to an LLM and get a response, the model is performing inference. It's the "using" phase as opposed to the "training" phase.

Inference happens every time you interact with ChatGPT, Claude, or any AI assistant—the model applies what it learned during training to your specific request.

Key Concepts

  • Latency: Time from input to output (speed)
  • Throughput: Requests processed per second
  • Cost: Computational resources needed
  • Batch inference: Processing multiple inputs together

Examples

Training vs Inference
Two Phases of ML
TRAINING (done once or rarely) ├── Process: Learn from data ├── Time: Days to months ├── Cost: $$$$ to $$$$$$ ├── Hardware: Many GPUs ├── Weights: Updated └── When: Before deployment INFERENCE (done constantly) ├── Process: Generate outputs ├── Time: Milliseconds to seconds ├── Cost: $ per request ├── Hardware: Can be single GPU ├── Weights: Frozen (no updates) └── When: Every user request Example: • GPT-4 training: Months, millions of dollars • GPT-4 inference: <1 second, fraction of a cent
Training creates the model; inference uses it.
Inference Metrics
What Matters for Production
LATENCY Time to first token (TTFT): 200ms Time per output token: 30ms Total time for 100 tokens: 200 + (30 × 100) = 3.2s THROUGHPUT Tokens per second: 33 tokens/s Requests per minute: varies by length COST BREAKDOWN Input tokens: $0.01 per 1K tokens Output tokens: $0.03 per 1K tokens A 500 input + 500 output request: (500 × $0.01/1000) + (500 × $0.03/1000) = $0.005 + $0.015 = $0.02 SCALING FACTORS • More users = need more compute • Longer outputs = more cost • Streaming = better UX, same cost
Understanding inference metrics helps optimize costs and user experience.
Inference Optimization
Making Inference Faster
TECHNIQUES TO SPEED UP INFERENCE: 1. QUANTIZATION Reduce precision (fp32 → int8) → Smaller model, faster math 2. BATCHING Process multiple requests together → Better GPU utilization 3. KV CACHING Cache key-value pairs for attention → Avoid recomputing for each token 4. SPECULATIVE DECODING Small model proposes, large verifies → Faster for certain patterns 5. DISTILLATION Train smaller model to mimic large one → Similar quality, much faster
Production systems use many techniques to optimize inference.

Interactive Exercise

Calculate Inference Cost

You're building a chatbot with these specs:

  • Average prompt: 500 tokens
  • Average response: 300 tokens
  • Expected: 10,000 conversations/day
  • Input cost: $0.01/1K tokens, Output: $0.03/1K tokens

What's your daily inference cost?

Pro Tips
  • Output tokens typically cost 2-3x more than input tokens
  • Shorter prompts = lower latency AND lower cost
  • Consider caching frequent responses
  • Monitor token usage to catch runaway costs

Related Terms