Skip to main content
ModelTerms

Comparison

Inference vs vLLM

Inference and vLLM are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Inference

Inference comes up when the question is fundamentally about inference.

A ChatGPT response: one inference call per turn.

When you would reach for vLLM

vLLM comes up when the question is fundamentally about infrastructure.

Serving Llama 3 70B at high QPS on 4 H100s with vLLM.

Frequently asked

What is the difference between Inference and vLLM?

Inference: Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache. vLLM: vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

When should I use Inference vs vLLM?

Inference is the right concept when you are focused on inference. vLLM applies when you are focused on infrastructure.

Are Inference and vLLM the same thing?

No. Inference is inference; vLLM is infrastructure. They are related but address different parts of the AI stack.