Skip to main content
ModelTerms

Infrastructure · intermediate

Quantization

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Explanation

A 70B-parameter model in BF16 needs ~140 GB. In INT4 it needs ~35 GB — small enough to fit on a single consumer GPU. Quantization can be applied post-training (GPTQ, AWQ, GGUF) or built into training (QAT — quantization-aware training).

Modern quantization schemes preserve most of the model's capability at INT8 and a meaningful fraction at INT4. INT2/INT3 schemes exist but typically lose noticeable quality.

Examples

  • Llama-3-70B-INT4 running on a 48 GB GPU instead of needing 2 A100s.
  • GGUF quantized weights running on a laptop via llama.cpp.

When to use quantization

When inference memory or speed is the binding constraint and you can tolerate a small quality drop.

Frequently asked

What is Quantization?

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

What is an example of quantization?

Llama-3-70B-INT4 running on a 48 GB GPU instead of needing 2 A100s.

How is Quantization related to Inference?

Quantization and Inference are both infrastructure concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

When should I use quantization?

When inference memory or speed is the binding constraint and you can tolerate a small quality drop.

Is Quantization considered intermediate?

Quantization is generally considered intermediate-level material in the AI and LLM space.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

LoRATraining

LoRA is a parameter-efficient fine-tuning method that freezes a model's original weights and learns small low-rank update matrices alongside them. Cheap fine-tuning on a single GPU.

Mixed PrecisionInfrastructure

Mixed-precision training does the bulk of forward and backward computation in 16-bit floats (BF16 or FP16) while keeping master weights and certain accumulations in 32-bit. Faster, smaller, same accuracy.

DistillationTraining

Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.

BFloat16Infrastructure

BFloat16 is a 16-bit floating-point format with FP32's exponent range but only 8 bits of mantissa. The default precision for LLM training and most inference.

QLoRATraining

QLoRA fine-tunes a 4-bit quantized base model with LoRA adapters, letting you train 70B-class models on a single 48 GB GPU at near-full fine-tuning quality.

Side-by-side comparisons

Sources