Infrastructure · advanced

Pipeline Parallelism (PP)

Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.

Published May 31, 2026

Explanation

Where tensor parallelism splits within a layer, pipeline parallelism splits between layers. Each GPU handles a contiguous slice of layers and passes activations downstream.

The challenge is the pipeline bubble — idle time at the start and end of each minibatch when the pipeline hasn't filled. GPipe and 1F1B scheduling alleviate this by interleaving forward and backward passes from different microbatches.

Pipeline parallelism scales across nodes (slower interconnects are OK because cross-node communication is minimal) and is the standard way to train models that don't fit on a single node. Often combined with TP within node + DP across pipelines.

Examples

A 405B model trained on 4096 GPUs: TP=8 within node × PP=16 across the pod × DP=32 across pods.

Frequently asked

What is Pipeline Parallelism?

Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.

What is an example of pipeline parallelism?

A 405B model trained on 4096 GPUs: TP=8 within node × PP=16 across the pod × DP=32 across pods.

How is Pipeline Parallelism related to Tensor Parallelism?

Pipeline Parallelism and Tensor Parallelism are both infrastructure concepts. Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

Is Pipeline Parallelism considered advanced?

Pipeline Parallelism is generally considered advanced-level material in the AI and LLM space.

Tensor ParallelismInfrastructure

Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

GPUInfrastructure

GPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Side-by-side comparisons

Sources

GPipe paper (arXiv)