Optimization / Model Compression

Quantization

Intermediate [3/5]
Weight quantization Model compression Precision reduction

Definition

Quantization reduces the numerical precision of model weights and/or activations, typically from 32-bit or 16-bit floating point to lower bit-widths like 8-bit or 4-bit integers. This decreases memory usage and can speed up inference with minimal accuracy loss.

Quantization is essential for deploying large models on resource-constrained devices and reducing serving costs.

Key Concepts

  • Weight quantization: Compress model weights (always applied)
  • Activation quantization: Quantize intermediate values (harder)
  • Post-training (PTQ): Quantize after training with calibration data
  • Quantization-aware training (QAT): Train with quantization in mind

Examples

Methods
Quantization Approaches
QUANTIZATION METHODS: 1. UNIFORM QUANTIZATION (INT8): Linearly map FP values to integers quant(x) = round(x / scale) + zero_point dequant(q) = (q - zero_point) × scale Example: FP16 range: [-1.0, 1.0] INT8 range: [-128, 127] scale = 2.0 / 255 ≈ 0.0078 0.5 → round(0.5 / 0.0078) = 64 2. NON-UNIFORM (NF4 - NormalFloat4): Optimized for normally distributed weights 16 quantization levels placed at quantiles Better for LLM weights than uniform INT4 3. GPTQ (Post-Training): Layer-by-layer quantization Minimizes reconstruction error Uses calibration data for accuracy 4. AWQ (Activation-Aware): Protects important weights based on activations Better quality than naive quantization 5. GGUF/GGML (llama.cpp): CPU-optimized quantization Multiple formats: Q4_0, Q4_K_M, Q5_K_M, etc. K = k-quant (mixed precision per block)
Comparison
Precision vs Quality Trade-offs
PRECISION COMPARISON (LLaMA 7B): ┌──────────┬─────────┬────────────┬───────────┐ │ Precision│ Size │ Memory │ Quality │ ├──────────┼─────────┼────────────┼───────────┤ │ FP32 │ 28 GB │ ~32 GB │ Baseline │ │ FP16 │ 14 GB │ ~16 GB │ ≈ Same │ │ BF16 │ 14 GB │ ~16 GB │ ≈ Same │ │ INT8 │ 7 GB │ ~8 GB │ ~99% │ │ INT4 │ 3.5 GB │ ~4 GB │ ~95-98% │ │ 2-bit │ 1.75 GB │ ~2 GB │ ~85-90% │ └──────────┴─────────┴────────────┴───────────┘ SPEED IMPROVEMENTS: INT8: 1.5-2× faster than FP16 (with INT8 hardware) INT4: 2-4× faster (memory bandwidth bound) POPULAR FORMATS: GPTQ (GPU): # Load with auto-gptq model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-GPTQ", device="cuda:0" ) AWQ (GPU): # Load with autoawq model = AutoAWQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-AWQ" ) GGUF (CPU/Metal): # Use with llama.cpp or llama-cpp-python ./main -m llama-2-7b.Q4_K_M.gguf BitsAndBytes (Training): model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", load_in_4bit=True # For QLoRA )

Interactive Exercise

Quantization Levels

INT8 has 256 possible values (2^8). How many distinct values can INT4 and INT2 represent?

Pro Tips
  • 4-bit is the sweet spot for most LLMs (4× compression, ~95% quality)
  • K-quants (Q4_K_M) often better than simple Q4_0
  • Larger models quantize better (70B Q4 > 7B Q4 relative quality)
  • Always benchmark on your specific use case

Related Terms