vLLM is the reference implementation of continuous batching for LLMs. The first stable release shipped in mid-2023; by 2026 it is the default serving engine across most open-source LLM deployments. The interesting bit is not that it batches; it is how it batches, what tradeoffs it picked, and where the design assumptions break.
This deep dive is for engineers running vLLM in production who want to understand what knobs matter, why throughput plateaus, and what the SGLang and TensorRT-LLM teams did differently.
Three concepts that make vLLM fast
The headline 23x throughput claim from vLLM's launch comes from three layered ideas.
Continuous batching schedules at the iteration level rather than the request level. Sequences enter and leave the batch every forward pass. Background and full breakdown in our continuous batching explainer.
PagedAttention treats KV cache memory like virtual memory. Each request's KV is split into fixed-size blocks (typically 16 tokens). Blocks are allocated on demand from a global pool, eliminating fragmentation. The vLLM team's SOSP 2023 paper is the canonical reference.
Prefix caching identifies common prompt prefixes (system prompts, few-shot examples) and reuses their KV cache across requests. With shared system prompts, throughput goes up another 30 to 60 percent.
Stack the three and you get 23x to 28x over naive serving on representative traces.
What scheduling looks like inside vLLM
The scheduler runs once per iteration. On each call it:
- Lists active sequences in the running queue.
- Picks new sequences from the waiting queue if there is free KV cache and chunk budget.
- If KV pressure is high, preempts the least recently used active sequence.
- Runs one model forward pass over all running sequences.
- Appends one generated token per sequence to its KV blocks.
Preemption is the part teams overlook. Under load, vLLM will recompute or swap the KV of preempted sequences. Both modes work; recomputation is cheaper at low context length, swapping wins at long context. The default switches based on context size.
Tuning knobs that actually matter
After running vLLM in production for a year, we have a short list of settings that move the throughput needle.
| Flag | Default | Recommended | Effect |
|---|---|---|---|
--gpu-memory-utilization | 0.9 | 0.92-0.95 | More KV blocks, more concurrency |
--max-num-seqs | 256 | 256-512 | Cap concurrent sequences |
--max-num-batched-tokens | auto | 8192-16384 | Per-iteration token budget |
--enable-chunked-prefill | on (0.7+) | on | Smooths P99 latency |
--enable-prefix-caching | off | on | 30-60% throughput on shared prompts |
--block-size | 16 | 16 | Leave alone |
--swap-space | 4 GiB | 16-32 GiB | Reduce recompute on preemption |
--tensor-parallel-size | 1 | 1-8 | Match to model size |
The single highest-value flag for most workloads is --enable-prefix-caching. If you have a system prompt that all requests share, this alone is a 30 percent throughput gain.
Throughput across configurations
Llama 3.1 8B Instruct, A100 80GB, ShareGPT trace.
Baseline (vLLM 0.7, defaults) |#################### | 4,200 tok/s
+ chunked prefill |####################### | 4,800 tok/s
+ prefix caching |########################## | 5,500 tok/s
+ tuned max-num-batched-tokens |#############################_| 6,200 tok/s
+ FP8 KV cache |#################################| 7,100 tok/s
Numbers from internal benchmarks reproducible with the vLLM benchmark harness. FP8 KV cache (added in vLLM 0.6) cuts KV memory in half and unlocks more concurrency, with measurable but small quality impact. Validate on your eval set.
Where vLLM's scheduler breaks down
Three honest weaknesses.
Long-context starvation. A single 100K-token prefill can stall the cluster for several seconds even with chunked prefill. SGLang handles this slightly better with its priority-aware scheduler.
Constrained decoding tax. When clients request structured output (JSON schema, regex constraints), vLLM falls back to a slower path that disables some batching optimizations. The Outlines-on-vLLM integration has improved through 2025 but still costs 15 to 30 percent throughput compared to free generation.
Multi-LoRA scheduling. vLLM's multi-LoRA support batches requests across different adapters in one forward pass, but the scheduling is round-robin rather than fairness-aware. If one LoRA gets bursty traffic, others see latency spikes.
For a side-by-side with TGI, SGLang, and TensorRT-LLM see our LLM serving frameworks 2026 comparison.
Production deployment patterns
Two patterns dominate in 2026.
Single-replica direct. vLLM container behind an HTTP load balancer. Simple, works up to roughly 8,000 tokens/sec on one A100. Good for early production.
Ray Serve plus vLLM workers. Multiple vLLM replicas behind Ray Serve, autoscaled by Ray. Fault-tolerant, supports rolling updates, and integrates with broader Ray clusters. The recommended pattern at scale; see our Ray pillar for the surrounding architecture.
A third pattern, OpenAI-compatible front + vLLM back, is what most teams ship. The front speaks the OpenAI API, validates auth and rate limits, and proxies to vLLM's /v1/completions. This is exactly the role a gateway like Swfte Connect plays — it speaks OpenAI to clients, batches by prompt prefix, and routes to your local vLLM cluster, falling back to managed APIs when local capacity saturates.
When to pick something else
| Need | Better choice |
|---|---|
| Hugging Face native deployment, simplest ops | TGI |
| Structured generation, JSON schema heavy | SGLang |
| NVIDIA Triton existing infrastructure | TensorRT-LLM |
| AMD ROCm GPUs | TGI or vLLM with ROCm wheel |
| Apple Silicon for dev | MLX or llama.cpp |
| Multi-modal (vision-language) | vLLM 0.7 plus, but verify your model |
vLLM is the right default. The list above is when "default" is wrong.
Cost per million tokens
Self-hosted, A100 80GB on AWS spot, May 2026 pricing, 70 percent utilization.
| Configuration | Tok/s | Cost / 1M tok |
|---|---|---|
| vLLM 0.7 default | 4,200 | $0.27 |
| vLLM tuned (above) | 6,200 | $0.18 |
| vLLM tuned + FP8 KV | 7,100 | $0.16 |
| TGI 2.4 default | 4,400 | $0.26 |
| SGLang 0.4 | 6,800 | $0.17 |
| Managed Together AI | n/a | $0.45 |
| OpenAI GPT-4o mini | n/a | $0.60 |
Self-hosted with tuning is roughly half the cost of the cheapest managed API. Whether that math works for you depends on operational headcount; many teams accept the higher API cost to avoid running GPUs.
What to do this quarter
- Upgrade to vLLM 0.7 or later. Earlier versions miss chunked prefill, FP8 KV, and prefix caching default-on improvements.
- Turn on
--enable-prefix-cachingif you have any shared system prompt. It is a free 30 percent throughput improvement on most chat workloads. - Benchmark with the official
benchmark_serving.pyharness and a real ShareGPT trace, not synthetic uniform-length traffic. - Move from single-replica direct to Ray Serve plus vLLM workers before you cross 50,000 requests per minute.
- Track P99 latency on long-context requests separately. If you see spikes above 10x median, investigate scheduler preemption —
--swap-space 32often fixes it.