Comparison

Tensor Parallelism vs vLLM

Tensor Parallelism and vLLM are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Tensor Parallelism

Tensor Parallelism comes up when the question is fundamentally about infrastructure.

Llama 3 70B in BF16 (~140 GB) split across 4× H100 (80 GB each) with TP=4.

When you would reach for vLLM

vLLM comes up when the question is fundamentally about infrastructure.

Serving Llama 3 70B at high QPS on 4 H100s with vLLM.

Frequently asked

What is the difference between Tensor Parallelism and vLLM?

Tensor Parallelism: Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel. vLLM: vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

When should I use Tensor Parallelism vs vLLM?

Tensor Parallelism is the right concept when you are focused on infrastructure. vLLM applies when you are focused on infrastructure.

Are Tensor Parallelism and vLLM the same thing?

No. Tensor Parallelism is infrastructure; vLLM is infrastructure. They are related but address different parts of the AI stack.