Inference · advanced
Continuous Batching (inflight batching)
Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.
Explanation
Static batching waits for all requests in a batch to finish before starting the next batch. With variable response lengths, the GPU sits idle once the shorter responses finish, waiting for the longest one. Wasteful.
Continuous batching kicks completed requests out of the batch immediately and slots new ones in. The batch reshuffles every decode step. GPU stays fully loaded; effective throughput often 5-20× higher than naive batching.
vLLM popularized continuous batching with PagedAttention as its memory backbone. TGI, TensorRT-LLM, and most production engines now implement variants.
Examples
- A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.
- Production inference cost per token drops by 4-8× moving from a homegrown server to vLLM with continuous batching.
Frequently asked
What is Continuous Batching?
Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.
What is an example of continuous batching?
A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.
How is Continuous Batching related to vLLM?
Continuous Batching and vLLM are both inference concepts. vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.
Is Continuous Batching considered advanced?
Continuous Batching is generally considered advanced-level material in the AI and LLM space.