Quantization | HyperKit.ai

Definition

Quantization reduces the numerical precision of model weights and/or activations, typically from 32-bit or 16-bit floating point to lower bit-widths like 8-bit or 4-bit integers. This decreases memory usage and can speed up inference with minimal accuracy loss.

Quantization is essential for deploying large models on resource-constrained devices and reducing serving costs.

Key Concepts

Weight quantization: Compress model weights (always applied)
Activation quantization: Quantize intermediate values (harder)
Post-training (PTQ): Quantize after training with calibration data
Quantization-aware training (QAT): Train with quantization in mind

Examples

Methods

Quantization Approaches

QUANTIZATION METHODS:

1. UNIFORM QUANTIZATION (INT8):
   Linearly map FP values to integers

   quant(x) = round(x / scale) + zero_point
   dequant(q) = (q - zero_point) × scale

   Example:
   FP16 range: [-1.0, 1.0]
   INT8 range: [-128, 127]
   scale = 2.0 / 255 ≈ 0.0078

   0.5 → round(0.5 / 0.0078) = 64

2. NON-UNIFORM (NF4 - NormalFloat4):
   Optimized for normally distributed weights
   16 quantization levels placed at quantiles
   Better for LLM weights than uniform INT4

3. GPTQ (Post-Training):
   Layer-by-layer quantization
   Minimizes reconstruction error
   Uses calibration data for accuracy

4. AWQ (Activation-Aware):
   Protects important weights based on activations
   Better quality than naive quantization

5. GGUF/GGML (llama.cpp):
   CPU-optimized quantization
   Multiple formats: Q4_0, Q4_K_M, Q5_K_M, etc.
   K = k-quant (mixed precision per block)

Comparison

Precision vs Quality Trade-offs

PRECISION COMPARISON (LLaMA 7B):

┌──────────┬─────────┬────────────┬───────────┐
│ Precision│ Size    │ Memory     │ Quality   │
├──────────┼─────────┼────────────┼───────────┤
│ FP32     │ 28 GB   │ ~32 GB     │ Baseline  │
│ FP16     │ 14 GB   │ ~16 GB     │ ≈ Same    │
│ BF16     │ 14 GB   │ ~16 GB     │ ≈ Same    │
│ INT8     │ 7 GB    │ ~8 GB      │ ~99%      │
│ INT4     │ 3.5 GB  │ ~4 GB      │ ~95-98%   │
│ 2-bit    │ 1.75 GB │ ~2 GB      │ ~85-90%   │
└──────────┴─────────┴────────────┴───────────┘

SPEED IMPROVEMENTS:
INT8: 1.5-2× faster than FP16 (with INT8 hardware)
INT4: 2-4× faster (memory bandwidth bound)

POPULAR FORMATS:

GPTQ (GPU):
# Load with auto-gptq
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ",
    device="cuda:0"
)

AWQ (GPU):
# Load with autoawq
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ"
)

GGUF (CPU/Metal):
# Use with llama.cpp or llama-cpp-python
./main -m llama-2-7b.Q4_K_M.gguf

BitsAndBytes (Training):
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True  # For QLoRA
)

Interactive Exercise

✎

Quantization Levels

INT8 has 256 possible values (2^8). How many distinct values can INT4 and INT2 represent?

Pro Tips

4-bit is the sweet spot for most LLMs (4× compression, ~95% quality)
K-quants (Q4_K_M) often better than simple Q4_0
Larger models quantize better (70B Q4 > 7B Q4 relative quality)
Always benchmark on your specific use case

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms