Infrastructure · advanced
Tensor Parallelism (TP)
Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.
Explanation
For models too big to fit on one GPU (or to run fast enough on one), the weights of each layer are split column-wise or row-wise across N GPUs. Each GPU computes its slice; the partial results are combined via all-reduce after each block.
Tensor parallelism keeps compute local (cheap intra-node communication on NVLink) but requires fast interconnects — within-node NVLink, not across-node networking. Beyond ~8 GPUs the all-reduce overhead dominates.
Used together with pipeline parallelism (split layers across GPUs, not within) and data parallelism (replicate the model, split the data) in large-scale training and inference. NeMo, Megatron, vLLM, and TensorRT-LLM all implement TP.
Examples
- Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.
- Frontier training runs commonly use TP=8 within node + pipeline parallelism + data parallelism across nodes.
Frequently asked
What is Tensor Parallelism?
Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.
What is an example of tensor parallelism?
Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.
How is Tensor Parallelism related to Pipeline Parallelism?
Tensor Parallelism and Pipeline Parallelism are both infrastructure concepts. Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.
Is Tensor Parallelism considered advanced?
Tensor Parallelism is generally considered advanced-level material in the AI and LLM space.