Inference · advanced

Continuous Batching (inflight batching)

Continuous batching lets new requests join an in-flight batch on the next decode step rather than waiting for the current batch to finish, dramatically raising GPU utilization on variable-length workloads.

Published May 31, 2026

Explanation

Static batching waits for all requests in a batch to finish before starting the next batch. With variable response lengths, the GPU sits idle once the shorter responses finish, waiting for the longest one. Wasteful.

Continuous batching kicks completed requests out of the batch immediately and slots new ones in. The batch reshuffles every decode step. GPU stays fully loaded; effective throughput often 5-20× higher than naive batching.

vLLM popularized continuous batching with PagedAttention as its memory backbone. TGI, TensorRT-LLM, and most production engines now implement variants.

Examples

A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.
Production inference cost per token drops by 4-8× moving from a homegrown server to vLLM with continuous batching.

Frequently asked

What is Continuous Batching?

What is an example of continuous batching?

A vLLM server: 200 concurrent users with variable-length responses; GPU utilization stays at 95% vs ~30% on static batching.

How is Continuous Batching related to vLLM?

Continuous Batching and vLLM are both inference concepts. vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

Is Continuous Batching considered advanced?

Continuous Batching is generally considered advanced-level material in the AI and LLM space.

vLLMInfrastructure

vLLM is an open-source high-throughput LLM serving engine. Its PagedAttention KV cache manager is the reason it dramatically outperforms naive serving setups.

KV CacheArchitecture

The KV cache stores the key and value vectors of all earlier tokens during generation so the model does not recompute them at every step. It is the main memory cost of LLM inference.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Time per Output TokenInference

Time per output token (TPOT) is the average wall-clock delay between consecutive generated tokens during streaming. Determines how fast text appears once generation starts.

Tensor ParallelismInfrastructure

Tensor parallelism shards individual layers across multiple GPUs — splitting each matrix multiplication so different GPUs compute different output dimensions in parallel.

Continuous Batching (inflight batching)

Explanation

Examples

Frequently asked

What is Continuous Batching?

What is an example of continuous batching?

How is Continuous Batching related to vLLM?

Is Continuous Batching considered advanced?

Side-by-side comparisons

Sources

Explanation

Examples

Frequently asked

What is Continuous Batching?

What is an example of continuous batching?

How is Continuous Batching related to vLLM?

Is Continuous Batching considered advanced?

Related terms

Side-by-side comparisons

Sources