Quantization reduces the numerical precision of model weights and/or activations, typically from 32-bit or 16-bit floating point to lower bit-widths like 8-bit or 4-bit integers. This decreases memory usage and can speed up inference with minimal accuracy loss.
Quantization is essential for deploying large models on resource-constrained devices and reducing serving costs.