If you serve large language models at any scale, the single biggest free lunch on the table in 2026 is continuous batching. The technique pre-dates the term — it was published as "iteration-level scheduling" in the Orca paper at OSDI 2022 — but the production reference implementation is what landed in vLLM, and the rest of the ecosystem has been catching up since.
This is the explainer we wished existed when we first read the paper. No hand-waving, just what continuous batching is, why it works, and what it costs.
The problem static batching cannot solve
Naive LLM inference batches requests at the request level. You wait for batch_size requests to arrive, run them through the model, and return the outputs. The catastrophic problem is that LLM outputs have wildly variable lengths. One request might generate 50 tokens; another in the same batch might generate 2,000.
With static batching, the batch finishes when the longest request finishes. Every other slot in the GPU is wasted, generating padding tokens that get discarded.
Request A (50 tokens) |#### | done at step 50, idles
Request B (2000 tokens) |##################################| done at step 2000
Request C (200 tokens) |############### | done at step 200, idles
Request D (1500 tokens) |#############################_ | done at step 1500
GPU utilization on real workloads with static batching: 20 to 35 percent. That is the baseline continuous batching attacks.
How continuous batching works
Continuous batching schedules at the iteration level rather than the request level. Each model forward pass processes one token from each active sequence. When a sequence finishes (emits an end-of-sequence token), it leaves the batch and a new request takes its slot — on the very next iteration.
The slot assignment looks like this for the same workload:
Iter Active slots
1 [A, B, C, D]
50 [E, B, C, D] <- A finished, E joined
200 [E, B, F, D] <- C finished, F joined
1500 [E, B, F, G] <- D finished, G joined
2000 [E, H, I, J] <- B finished, three new
Slots never idle waiting for the longest sequence. The GPU stays busy.
The reference paper is Yu and colleagues, Orca: A Distributed Serving System for Transformer-Based Generative Models, OSDI 2022. The first widely deployed implementation is vLLM, github.com/vllm-project/vllm.
The throughput numbers
Throughput improvements measured on Llama 2 13B, A100 80GB, ShareGPT trace.
| Approach | Tokens/sec | Multiplier |
|---|---|---|
| Naive HuggingFace pipeline | 220 | 1.0x |
| Static batching, batch=8 | 850 | 3.9x |
| Static batching, batch=32 | 1,400 | 6.4x |
| Continuous batching (vLLM) | 5,100 | 23.2x |
| Continuous batching plus PagedAttention | 6,200 | 28.2x |
The 23x figure quoted in vLLM's launch blog is from this exact comparison. PagedAttention on top adds another 15 to 25 percent by eliminating KV cache fragmentation. For the deep mechanics, see our vLLM continuous batching deep dive.
Why GPU utilization triples
Three compounding effects.
First, no padding waste. With static batching, any tokens past the shortest sequence are padding. With continuous batching, every token in the batch is real work.
Second, higher effective batch size. Because slots free up continuously, you can run with batch size 256 even though your average concurrent active requests are 50. The model sees a high enough batch to amortize the cost of loading weights from HBM.
Third, better tail latency. A single 2,000-token request no longer blocks the 50-token requests behind it. P99 latency drops by 4x to 10x in our measurements.
What continuous batching costs
It is not free. Three real costs.
Implementation complexity. The runtime has to manage sequence state across iterations, KV cache slots, and rolling memory. It is hard. This is why most teams use vLLM, TGI, or SGLang rather than rolling their own.
Memory pressure. With more concurrent sequences active, KV cache fragmentation matters. PagedAttention solves this but adds bookkeeping overhead of about 3 to 5 percent.
Scheduler overhead. At very small request sizes (single-token classification, embeddings) the scheduler bookkeeping outweighs the benefit. Continuous batching is for generation, not embedding lookups.
Implementations that ship continuous batching
| System | Continuous batching | KV cache layout | Best for |
|---|---|---|---|
| vLLM | Yes, reference impl | PagedAttention | General-purpose serving |
| TGI 2.x | Yes | Block-based | Hugging Face native |
| SGLang | Yes plus prefix sharing | RadixAttention | Structured generation |
| TensorRT-LLM | Yes ("inflight batching") | Custom | NVIDIA stack |
| Triton + TRT-LLM | Yes | Custom | Existing Triton shops |
| LMDeploy | Yes | Block-based | Strong on Qwen, InternLM |
| DeepSpeed-FastGen | Yes | Dynamic SplitFuse | Microsoft stack |
For a feature-by-feature comparison see our LLM serving frameworks 2026 guide.
The interaction with chunked prefill
A subtle but important point. Continuous batching deals with the decode phase. The prefill phase — processing the entire prompt — is a separate scheduling problem. If a 4K-token prefill arrives, naive continuous batching pauses decode for everyone while the prefill runs.
Chunked prefill, introduced in vLLM 0.5 and standard since 0.7, breaks the prefill into small chunks and interleaves them with decode iterations. The result is more uniform per-iteration latency. Without chunked prefill, P99 spikes by 3x to 8x when long-context requests arrive.
Set this in vLLM with --enable-chunked-prefill. In TGI it is on by default in 2.4 plus.
When continuous batching does not help
Continuous batching is not the right answer for:
- Embedding workloads. Single forward pass, no decode loop. Use Triton with dynamic batching instead.
- Diffusion image generation. The compute pattern is fundamentally different. Use stream-batched diffusion runners.
- Single-user latency-critical chat. If you have one user and need lowest possible latency, batch size 1 with no continuous batching is fastest. The technique trades latency for throughput when you have multiple concurrent users.
- Long-output structured generation with constraints. Constrained decoding (JSON schema) breaks some continuous batching assumptions; SGLang handles this better than naive vLLM.
Routing across batched serving fleets
Once continuous batching pushes a single GPU to 6,000 tokens per second, the bottleneck moves to traffic shaping. Hot requests should land on warm KV caches; cold requests can take the long path. A gateway like Swfte Connect routes by prompt prefix hash so repeat queries hit the same backend, which combines with PagedAttention's prefix sharing to multiply throughput. For the architectural picture see our pillar on Ray and alternatives.
What to do this quarter
- If you are still running naive Hugging Face
model.generate()in a serving loop, stop. Move to vLLM, TGI, or SGLang this sprint. Expect a 10x to 25x throughput jump. - Turn on chunked prefill on whatever engine you use. If your engine does not support it, that is a sign to migrate.
- Benchmark P99 latency under realistic mixed traffic, not synthetic uniform-length requests. Continuous batching wins disproportionately on bimodal traffic.
- Reduce per-replica GPU count and add replicas. Continuous batching works best when each replica has high concurrent request count, not when each replica has many GPUs.
- Measure batch size in flight, not configured maximum. Many teams set max batch 256 and never reach 30 in practice; that is a load-balancing problem, not a serving problem.