Infrastructure · intermediate
Quantization
Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.
Explanation
A 70B-parameter model in BF16 needs ~140 GB. In INT4 it needs ~35 GB — small enough to fit on a single consumer GPU. Quantization can be applied post-training (GPTQ, AWQ, GGUF) or built into training (QAT — quantization-aware training).
Modern quantization schemes preserve most of the model's capability at INT8 and a meaningful fraction at INT4. INT2/INT3 schemes exist but typically lose noticeable quality.
Examples
- Llama-3-70B-INT4 running on a 48 GB GPU instead of needing 2 A100s.
- GGUF quantized weights running on a laptop via llama.cpp.
When to use quantization
When inference memory or speed is the binding constraint and you can tolerate a small quality drop.
Frequently asked
What is Quantization?
Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.
What is an example of quantization?
Llama-3-70B-INT4 running on a 48 GB GPU instead of needing 2 A100s.
How is Quantization related to Inference?
Quantization and Inference are both infrastructure concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.
When should I use quantization?
When inference memory or speed is the binding constraint and you can tolerate a small quality drop.
Is Quantization considered intermediate?
Quantization is generally considered intermediate-level material in the AI and LLM space.