Infrastructure · advanced
Pipeline Parallelism (PP)
Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.
Explanation
Where tensor parallelism splits within a layer, pipeline parallelism splits between layers. Each GPU handles a contiguous slice of layers and passes activations downstream.
The challenge is the pipeline bubble — idle time at the start and end of each minibatch when the pipeline hasn't filled. GPipe and 1F1B scheduling alleviate this by interleaving forward and backward passes from different microbatches.
Pipeline parallelism scales across nodes (slower interconnects are OK because cross-node communication is minimal) and is the standard way to train models that don't fit on a single node. Often combined with TP within node + DP across pipelines.
Examples
- A 405B model trained on 4096 GPUs: TP=8 within node × PP=16 across the pod × DP=32 across pods.
Frequently asked
What is Pipeline Parallelism?
Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.
What is an example of pipeline parallelism?
A 405B model trained on 4096 GPUs: TP=8 within node × PP=16 across the pod × DP=32 across pods.
How is Pipeline Parallelism related to Tensor Parallelism?
Pipeline Parallelism and Tensor Parallelism are both infrastructure concepts. Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.
Is Pipeline Parallelism considered advanced?
Pipeline Parallelism is generally considered advanced-level material in the AI and LLM space.