QLoRA | HyperKit.ai

Definition

QLoRA combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of large models on a single GPU. The base model is loaded in 4-bit precision while LoRA adapters are trained in higher precision (BF16/FP16).

QLoRA makes it possible to fine-tune 65B+ parameter models on consumer hardware with minimal quality loss.

Key Concepts

4-bit NormalFloat: Custom data type optimized for neural network weights
Double quantization: Quantizes quantization constants for extra savings
Paged optimizers: Offloads optimizer states to CPU when needed
BF16 adapters: LoRA weights trained in higher precision

Examples

Memory

QLoRA Memory Savings

MEMORY COMPARISON FOR 65B MODEL:

FULL FINE-TUNING (FP16):
Model weights:    130 GB
Gradients:        130 GB
Optimizer (Adam): 260 GB
Activations:       50 GB
─────────────────────────
Total:           ~570 GB (needs 8× A100 80GB!)

LORA (FP16 base):
Model weights:    130 GB (frozen, no grads)
LoRA params:        0.5 GB
Gradients:          0.5 GB
Optimizer:          1 GB
─────────────────────────
Total:           ~132 GB (still needs 2× A100)

QLORA (4-bit base):
Model weights:     16 GB (4-bit quantized!)
LoRA params:        0.5 GB (BF16)
Gradients:          0.5 GB
Optimizer:          1 GB
Activations:       ~5 GB (with gradient checkpointing)
─────────────────────────
Total:           ~23 GB (single RTX 4090!)

BREAKTHROUGH:
65B model fine-tuning: 8× A100 → 1× RTX 4090
Cost reduction: ~$100/hour → ~$1/hour

Implementation

QLoRA Training Setup

QLORA WITH BITSANDBYTES + PEFT:

from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,  # double quant
    bnb_4bit_quant_type="nf4",       # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA configuration
lora_config = LoraConfig(
    r=64,                    # higher rank for larger models
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"  # MLP too
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Train with gradient checkpointing
model.gradient_checkpointing_enable()

HARDWARE REQUIREMENTS:
┌─────────────┬────────────────────────┐
│ Model Size  │ GPU Memory (QLoRA)     │
├─────────────┼────────────────────────┤
│ 7B          │ ~6 GB (RTX 3060)       │
│ 13B         │ ~10 GB (RTX 3080)      │
│ 30B         │ ~16 GB (RTX 4080)      │
│ 65B         │ ~24 GB (RTX 4090)      │
│ 70B         │ ~32 GB (A100 40GB)     │
└─────────────┴────────────────────────┘

Interactive Exercise

✎

Memory Reduction

A 13B parameter model uses 26GB in FP16. How much memory does the 4-bit quantized version use (ignoring other overhead)?

Pro Tips

Use NF4 quantization type for best quality (not INT4)
Enable double quantization for ~0.4 bits/param extra savings
Gradient checkpointing trades compute for memory
QLoRA quality matches full fine-tuning on most benchmarks

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms