Skip to main content
ModelTerms

Infrastructure · intermediate

BFloat16 (BF16)

BFloat16 is a 16-bit floating-point format with FP32's exponent range but only 8 bits of mantissa. The default precision for LLM training and most inference.

Explanation

Storing weights and activations in 16 bits halves memory vs FP32 and roughly halves bandwidth requirements. But FP16 has a tiny exponent range that causes gradient underflow during training.

BFloat16 keeps FP32's 8-bit exponent (huge range) and trims the mantissa to 8 bits (less precision). For neural networks, range matters more than fine precision — so BF16 trains stably out of the box, no loss scaling needed.

All modern frontier training runs in BF16 (sometimes with mixed-precision FP8 for memory-bound layers).

Examples

  • Llama 3 trained end-to-end in BF16.
  • NVIDIA H100 native BF16 throughput ~989 TFLOPs.

Frequently asked

What is BFloat16?

BFloat16 is a 16-bit floating-point format with FP32's exponent range but only 8 bits of mantissa. The default precision for LLM training and most inference.

What is an example of bfloat16?

Llama 3 trained end-to-end in BF16.

How is BFloat16 related to Mixed Precision?

BFloat16 and Mixed Precision are both infrastructure concepts. Mixed-precision training does the bulk of forward and backward computation in 16-bit floats (BF16 or FP16) while keeping master weights and certain accumulations in 32-bit. Faster, smaller, same accuracy.

Is BFloat16 considered intermediate?

BFloat16 is generally considered intermediate-level material in the AI and LLM space.

Mixed PrecisionInfrastructure

Mixed-precision training does the bulk of forward and backward computation in 16-bit floats (BF16 or FP16) while keeping master weights and certain accumulations in 32-bit. Faster, smaller, same accuracy.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

GPUInfrastructure

GPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

QLoRATraining

QLoRA fine-tunes a 4-bit quantized base model with LoRA adapters, letting you train 70B-class models on a single 48 GB GPU at near-full fine-tuning quality.

Side-by-side comparisons

Sources