Training / Fine-Tuning

QLoRA

Advanced [4/5]
Quantized LoRA 4-bit LoRA

Definition

QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of large models on a single GPU. The base model is loaded in 4-bit precision while LoRA adapters are trained in higher precision (BF16/FP16).

QLoRA makes it possible to fine-tune 65B+ parameter models on consumer hardware with minimal quality loss.

Key Concepts

  • 4-bit NormalFloat: Custom data type optimized for neural network weights
  • Double quantization: Quantizes quantization constants for extra savings
  • Paged optimizers: Offloads optimizer states to CPU when needed
  • BF16 adapters: LoRA weights trained in higher precision

Examples

Memory
QLoRA Memory Savings
MEMORY COMPARISON FOR 65B MODEL: FULL FINE-TUNING (FP16): Model weights: 130 GB Gradients: 130 GB Optimizer (Adam): 260 GB Activations: 50 GB ───────────────────────── Total: ~570 GB (needs 8× A100 80GB!) LORA (FP16 base): Model weights: 130 GB (frozen, no grads) LoRA params: 0.5 GB Gradients: 0.5 GB Optimizer: 1 GB ───────────────────────── Total: ~132 GB (still needs 2× A100) QLORA (4-bit base): Model weights: 16 GB (4-bit quantized!) LoRA params: 0.5 GB (BF16) Gradients: 0.5 GB Optimizer: 1 GB Activations: ~5 GB (with gradient checkpointing) ───────────────────────── Total: ~23 GB (single RTX 4090!) BREAKTHROUGH: 65B model fine-tuning: 8× A100 → 1× RTX 4090 Cost reduction: ~$100/hour → ~$1/hour
Implementation
QLoRA Training Setup
QLORA WITH BITSANDBYTES + PEFT: from transformers import ( AutoModelForCausalLM, BitsAndBytesConfig ) from peft import LoraConfig, get_peft_model # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, # double quant bnb_4bit_quant_type="nf4", # NormalFloat4 bnb_4bit_compute_dtype=torch.bfloat16 ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b", quantization_config=bnb_config, device_map="auto" ) # LoRA configuration lora_config = LoraConfig( r=64, # higher rank for larger models lora_alpha=16, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" # MLP too ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) # Apply LoRA model = get_peft_model(model, lora_config) # Train with gradient checkpointing model.gradient_checkpointing_enable() HARDWARE REQUIREMENTS: ┌─────────────┬────────────────────────┐ │ Model Size │ GPU Memory (QLoRA) │ ├─────────────┼────────────────────────┤ │ 7B │ ~6 GB (RTX 3060) │ │ 13B │ ~10 GB (RTX 3080) │ │ 30B │ ~16 GB (RTX 4080) │ │ 65B │ ~24 GB (RTX 4090) │ │ 70B │ ~32 GB (A100 40GB) │ └─────────────┴────────────────────────┘

Interactive Exercise

Memory Reduction

A 13B parameter model uses 26GB in FP16. How much memory does the 4-bit quantized version use (ignoring other overhead)?

Pro Tips
  • Use NF4 quantization type for best quality (not INT4)
  • Enable double quantization for ~0.4 bits/param extra savings
  • Gradient checkpointing trades compute for memory
  • QLoRA quality matches full fine-tuning on most benchmarks

Related Terms