Skip to main content
ModelTerms

Inference · advanced

Continuous Batching (inflight batching)

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

Explanation

Static batching waits for all requests in a batch to finish before starting the next batch. With variable response lengths, the GPU sits idle once the shorter responses finish, waiting for the longest one. Wasteful.

Continuous batching kicks completed requests out of the batch immediately and slots new ones in. The batch reshuffles every decode step. GPU stays fully loaded; effective throughput often 5-20× higher than naive batching.

vLLM popularized continuous batching with PagedAttention as its memory backbone. TGI, TensorRT-LLM, and most production engines now implement variants.

Examples

  • A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.
  • Production inference cost per token drops by 4-8× moving from a homegrown server to vLLM with continuous batching.

Frequently asked

What is Continuous Batching?

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

What is an example of continuous batching?

A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.

How is Continuous Batching related to vLLM?

Continuous Batching and vLLM are both inference concepts. vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

Is Continuous Batching considered advanced?

Continuous Batching is generally considered advanced-level material in the AI and LLM space.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Time per Output TokenInference

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

Tensor ParallelismInfrastructure

Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

Side-by-side comparisons

Sources