Quantization

MLOps

Storing model weights and activations in lower precision (int8, int4) so the model is smaller, cheaper, and faster — at some quality cost.


In one line

Storing model weights and activations in lower precision (int8, int4) so the model is smaller, cheaper, and faster — at some quality cost.

What it actually means

A trained model has weights as fp32 or fp16 floats. Quantization maps each tensor to a lower-precision integer grid using a scale (and sometimes a zero point) per tensor, channel, or group. Common targets: int8 (cuts memory in half from fp16), int4 (quarters it), and exotic formats like NF4 used by QLoRA. You can quantize weights only (cheap, modest speedup) or weights and activations (bigger speedup, more accuracy risk). Post-training quantization (PTQ) is the easy path; quantization-aware training (QAT) preserves more quality at the cost of training cycles.

Why it matters

Quantization is what makes it possible to run a 70B-parameter model on a single GPU, or a 7B on a laptop. For inference at scale it’s a direct cost lever — fewer GPUs, lower memory bandwidth, faster batches. The tradeoff is quality regression, especially on long-tail prompts and reasoning, so you should always re-evaluate after quantizing.

Example

fp16 weight  : 0.7421875
scale        : 0.0078125
quant_int8   : round(0.7421875 / 0.0078125) = 95
dequantized  : 95 * 0.0078125 = 0.7421875

You’ll hear it when

  • Picking a model size that fits your serving GPU.
  • Reading about GGUF, AWQ, GPTQ, or bitsandbytes.
  • Running an LLM locally with Ollama or llama.cpp.
  • Doing QLoRA fine-tuning.
  • Investigating a quality regression that started after a serving change.

Related terms