Infrastructure · advanced

Tensor Parallelism (TP)

Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

Published May 31, 2026

Explanation

For models too big to fit on one GPU (or to run fast enough on one), the weights of each layer are split column-wise or row-wise across N GPUs. Each GPU computes its slice; the partial results are combined via all-reduce after each block.

Tensor parallelism keeps compute local (cheap intra-node communication on NVLink) but requires fast interconnects — within-node NVLink, not across-node networking. Beyond ~8 GPUs the all-reduce overhead dominates.

Used together with pipeline parallelism (split layers across GPUs, not within) and data parallelism (replicate the model, split the data) in large-scale training and inference. NeMo, Megatron, vLLM, and TensorRT-LLM all implement TP.

Examples

Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.
Frontier training runs commonly use TP=8 within node + pipeline parallelism + data parallelism across nodes.

Frequently asked

What is Tensor Parallelism?

Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

What is an example of tensor parallelism?

Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.

How is Tensor Parallelism related to Pipeline Parallelism?

Tensor Parallelism and Pipeline Parallelism are both infrastructure concepts. Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.

Is Tensor Parallelism considered advanced?

Tensor Parallelism is generally considered advanced-level material in the AI and LLM space.

Pipeline ParallelismInfrastructure

Pipeline parallelism splits the model by layer across GPUs — GPU 1 holds layers 0-15, GPU 2 holds 16-31, etc. Forward passes flow through the pipeline like an assembly line.

GPUInfrastructure

GPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

Side-by-side comparisons

Sources

Megatron-LM paper (arXiv)