technology

vLLM Continuous Batching Deep Dive: PagedAttention, Scheduling, and Tuning

How vLLM implements continuous batching, how to tune it, and where its scheduler breaks down.

May 3, 2026

English

vLLM is the reference implementation of continuous batching for LLMs. The first stable release shipped in mid-2023; by 2026 it is the default serving engine across most open-source LLM deployments. The interesting bit is not that it batches; it is how it batches, what tradeoffs it picked, and where the design assumptions break.

This deep dive is for engineers running vLLM in production who want to understand what knobs matter, why throughput plateaus, and what the SGLang and TensorRT-LLM teams did differently.

Three concepts that make vLLM fast

The headline 23x throughput claim from vLLM's launch comes from three layered ideas.

Continuous batching schedules at the iteration level rather than the request level. Sequences enter and leave the batch every forward pass. Background and full breakdown in our continuous batching explainer.

PagedAttention treats KV cache memory like virtual memory. Each request's KV is split into fixed-size blocks (typically 16 tokens). Blocks are allocated on demand from a global pool, eliminating fragmentation. The vLLM team's SOSP 2023 paper is the canonical reference.

Prefix caching identifies common prompt prefixes (system prompts, few-shot examples) and reuses their KV cache across requests. With shared system prompts, throughput goes up another 30 to 60 percent.

Stack the three and you get 23x to 28x over naive serving on representative traces.

What scheduling looks like inside vLLM

The scheduler runs once per iteration. On each call it:

Lists active sequences in the running queue.
Picks new sequences from the waiting queue if there is free KV cache and chunk budget.
If KV pressure is high, preempts the least recently used active sequence.
Runs one model forward pass over all running sequences.
Appends one generated token per sequence to its KV blocks.

Preemption is the part teams overlook. Under load, vLLM will recompute or swap the KV of preempted sequences. Both modes work; recomputation is cheaper at low context length, swapping wins at long context. The default switches based on context size.

Tuning knobs that actually matter

After running vLLM in production for a year, we have a short list of settings that move the throughput needle.

Flag	Default	Recommended	Effect
`--gpu-memory-utilization`	0.9	0.92-0.95	More KV blocks, more concurrency
`--max-num-seqs`	256	256-512	Cap concurrent sequences
`--max-num-batched-tokens`	auto	8192-16384	Per-iteration token budget
`--enable-chunked-prefill`	on (0.7+)	on	Smooths P99 latency
`--enable-prefix-caching`	off	on	30-60% throughput on shared prompts
`--block-size`	16	16	Leave alone
`--swap-space`	4 GiB	16-32 GiB	Reduce recompute on preemption
`--tensor-parallel-size`	1	1-8	Match to model size

The single highest-value flag for most workloads is --enable-prefix-caching. If you have a system prompt that all requests share, this alone is a 30 percent throughput gain.

Throughput across configurations

Llama 3.1 8B Instruct, A100 80GB, ShareGPT trace.

Baseline (vLLM 0.7, defaults)        |####################          | 4,200 tok/s
+ chunked prefill                    |#######################       | 4,800 tok/s
+ prefix caching                     |##########################    | 5,500 tok/s
+ tuned max-num-batched-tokens       |#############################_| 6,200 tok/s
+ FP8 KV cache                       |#################################| 7,100 tok/s

Numbers from internal benchmarks reproducible with the vLLM benchmark harness. FP8 KV cache (added in vLLM 0.6) cuts KV memory in half and unlocks more concurrency, with measurable but small quality impact. Validate on your eval set.

Where vLLM's scheduler breaks down

Three honest weaknesses.

Long-context starvation. A single 100K-token prefill can stall the cluster for several seconds even with chunked prefill. SGLang handles this slightly better with its priority-aware scheduler.

Constrained decoding tax. When clients request structured output (JSON schema, regex constraints), vLLM falls back to a slower path that disables some batching optimizations. The Outlines-on-vLLM integration has improved through 2025 but still costs 15 to 30 percent throughput compared to free generation.

Multi-LoRA scheduling. vLLM's multi-LoRA support batches requests across different adapters in one forward pass, but the scheduling is round-robin rather than fairness-aware. If one LoRA gets bursty traffic, others see latency spikes.

For a side-by-side with TGI, SGLang, and TensorRT-LLM see our LLM serving frameworks 2026 comparison.

Production deployment patterns

Two patterns dominate in 2026.

Single-replica direct. vLLM container behind an HTTP load balancer. Simple, works up to roughly 8,000 tokens/sec on one A100. Good for early production.

Ray Serve plus vLLM workers. Multiple vLLM replicas behind Ray Serve, autoscaled by Ray. Fault-tolerant, supports rolling updates, and integrates with broader Ray clusters. The recommended pattern at scale; see our Ray pillar for the surrounding architecture.

A third pattern, OpenAI-compatible front + vLLM back, is what most teams ship. The front speaks the OpenAI API, validates auth and rate limits, and proxies to vLLM's /v1/completions. This is exactly the role a gateway like Swfte Connect plays — it speaks OpenAI to clients, batches by prompt prefix, and routes to your local vLLM cluster, falling back to managed APIs when local capacity saturates.

When to pick something else

Need	Better choice
Hugging Face native deployment, simplest ops	TGI
Structured generation, JSON schema heavy	SGLang
NVIDIA Triton existing infrastructure	TensorRT-LLM
AMD ROCm GPUs	TGI or vLLM with ROCm wheel
Apple Silicon for dev	MLX or llama.cpp
Multi-modal (vision-language)	vLLM 0.7 plus, but verify your model

vLLM is the right default. The list above is when "default" is wrong.

Cost per million tokens

Self-hosted, A100 80GB on AWS spot, May 2026 pricing, 70 percent utilization.

Configuration	Tok/s	Cost / 1M tok
vLLM 0.7 default	4,200	$0.27
vLLM tuned (above)	6,200	$0.18
vLLM tuned + FP8 KV	7,100	$0.16
TGI 2.4 default	4,400	$0.26
SGLang 0.4	6,800	$0.17
Managed Together AI	n/a	$0.45
OpenAI GPT-4o mini	n/a	$0.60

Self-hosted with tuning is roughly half the cost of the cheapest managed API. Whether that math works for you depends on operational headcount; many teams accept the higher API cost to avoid running GPUs.

What to do this quarter

Upgrade to vLLM 0.7 or later. Earlier versions miss chunked prefill, FP8 KV, and prefix caching default-on improvements.
Turn on --enable-prefix-caching if you have any shared system prompt. It is a free 30 percent throughput improvement on most chat workloads.
Benchmark with the official benchmark_serving.py harness and a real ShareGPT trace, not synthetic uniform-length traffic.
Move from single-replica direct to Ray Serve plus vLLM workers before you cross 50,000 requests per minute.
Track P99 latency on long-context requests separately. If you see spikes above 10x median, investigate scheduler preemption — --swap-space 32 often fixes it.

Posted intechnology

vLLM Continuous Batching PagedAttention LLM Inference GPU Tuning

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles