Quantization
Storing model weights and activations in lower precision (int8, int4) so the model is smaller, cheaper, and faster — at some quality cost.
In one line
Storing model weights and activations in lower precision (int8, int4) so the model is smaller, cheaper, and faster — at some quality cost.
What it actually means
A trained model has weights as fp32 or fp16 floats. Quantization maps each tensor to a lower-precision integer grid using a scale (and sometimes a zero point) per tensor, channel, or group. Common targets: int8 (cuts memory in half from fp16), int4 (quarters it), and exotic formats like NF4 used by QLoRA. You can quantize weights only (cheap, modest speedup) or weights and activations (bigger speedup, more accuracy risk). Post-training quantization (PTQ) is the easy path; quantization-aware training (QAT) preserves more quality at the cost of training cycles.
Why it matters
Quantization is what makes it possible to run a 70B-parameter model on a single GPU, or a 7B on a laptop. For inference at scale it’s a direct cost lever — fewer GPUs, lower memory bandwidth, faster batches. The tradeoff is quality regression, especially on long-tail prompts and reasoning, so you should always re-evaluate after quantizing.
Example
fp16 weight : 0.7421875
scale : 0.0078125
quant_int8 : round(0.7421875 / 0.0078125) = 95
dequantized : 95 * 0.0078125 = 0.7421875
You’ll hear it when
- Picking a model size that fits your serving GPU.
- Reading about GGUF, AWQ, GPTQ, or bitsandbytes.
- Running an LLM locally with Ollama or llama.cpp.
- Doing QLoRA fine-tuning.
- Investigating a quality regression that started after a serving change.