Comparison

Pipeline Parallelism vs Training Compute

Pipeline Parallelism and Training Compute are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Pipeline Parallelism

Pipeline Parallelism comes up when the question is fundamentally about infrastructure.

A 405B model trained on 4096 GPUs: TP=8 within node × PP=16 across the pod × DP=32 across pods.

When you would reach for Training Compute

Training Compute comes up when the question is fundamentally about training.

GPT-3: ~3 × 10^23 FLOPs.

Frequently asked

What is the difference between Pipeline Parallelism and Training Compute?

Pipeline Parallelism: Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line. Training Compute: Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

When should I use Pipeline Parallelism vs Training Compute?

Pipeline Parallelism is the right concept when you are focused on infrastructure. Training Compute applies when you are focused on training.

Are Pipeline Parallelism and Training Compute the same thing?

No. Pipeline Parallelism is infrastructure; Training Compute is training. They are related but address different parts of the AI stack.