Comparison

Tensor Parallelism vs Training Compute

Tensor Parallelism and Training Compute are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Tensor Parallelism

Tensor Parallelism comes up when the question is fundamentally about infrastructure.

Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.

When you would reach for Training Compute

Training Compute comes up when the question is fundamentally about training.

GPT-3: ~3 × 10^23 FLOPs.

Frequently asked

What is the difference between Tensor Parallelism and Training Compute?

Tensor Parallelism: Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel. Training Compute: Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

When should I use Tensor Parallelism vs Training Compute?

Tensor Parallelism is the right concept when you are focused on infrastructure. Training Compute applies when you are focused on training.

Are Tensor Parallelism and Training Compute the same thing?

No. Tensor Parallelism is infrastructure; Training Compute is training. They are related but address different parts of the AI stack.