Skip to main content
ModelTerms

Infrastructure · advanced

vLLM

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

Explanation

vLLM treats the KV cache like virtual memory — pages of fixed size, reusable across requests, with paging tables instead of contiguous allocation. The result is 2-24x higher throughput than naive serving, especially for variable-length requests and large batches.

It supports continuous batching, speculative decoding, quantized models, and most open-source architectures. Many companies build their production LLM serving on top of vLLM or its derivatives.

Examples

  • Serving Llama 3 70B at high QPS on 4 H100s with vLLM.
  • Cloud LLM providers built on vLLM under the hood.

Frequently asked

What is vLLM?

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

What is an example of vllm?

Serving Llama 3 70B at high QPS on 4 H100s with vLLM.

How is vLLM related to KV Cache?

vLLM and KV Cache are both infrastructure concepts. The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

Is vLLM considered advanced?

vLLM is generally considered advanced-level material in the AI and LLM space.

Side-by-side comparisons

Sources