Comparison
Inference vs vLLM
Inference and vLLM are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.
When you would reach for Inference
Inference comes up when the question is fundamentally about inference.
A ChatGPT response: one inference call per turn.
When you would reach for vLLM
vLLM comes up when the question is fundamentally about infrastructure.
Serving Llama 3 70B at high QPS on 4 H100s with vLLM.
Frequently asked
What is the difference between Inference and vLLM?
Inference: Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache. vLLM: vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.
When should I use Inference vs vLLM?
Inference is the right concept when you are focused on inference. vLLM applies when you are focused on infrastructure.
Are Inference and vLLM the same thing?
No. Inference is inference; vLLM is infrastructure. They are related but address different parts of the AI stack.