technology

LLM Serving Frameworks 2026: vLLM, TGI, SGLang, TensorRT-LLM Compared

Six LLM serving frameworks compared on throughput, ops cost, and feature surface for 2026.

May 2, 2026

English

LLM serving used to be a one-horse race. By 2026 there are six serious open-source frameworks plus several managed alternatives, and choosing among them is harder than it should be because every benchmark is from a vendor with a model in the fight. This is the comparison we wish vendors published.

The criteria: throughput, latency, model coverage, operational cost, structured-generation support, and what happens when something breaks.

The shortlist

Framework	Latest release	Maintainer	Best for
vLLM	0.7+	vLLM project (Berkeley + community)	General-purpose default
TGI	2.4+	Hugging Face	HF native deployments
SGLang	0.4+	LMSYS	Structured gen and complex prompts
TensorRT-LLM	0.13+	NVIDIA	NVIDIA-only, max throughput
LMDeploy	0.7+	Shanghai AI Lab	Qwen, InternLM, Chinese models
Ray Serve + vLLM	Ray 2.10+	Anyscale + community	Multi-replica orchestration

Honorable mentions: DeepSpeed-FastGen (Microsoft, narrow but fast), llama.cpp (CPU and Apple Silicon dev), MLX (Apple Silicon prod is rare but exists).

Headline throughput

Llama 3.1 8B Instruct, A100 80GB, ShareGPT trace, May 2026 versions.

SGLang 0.4              |#################################| 6,800 tok/s
vLLM 0.7 (tuned)        |##############################   | 6,200 tok/s
LMDeploy 0.7            |############################     | 5,800 tok/s
Ray Serve + vLLM        |#############################_   | 6,050 tok/s
TGI 2.4                 |######################           | 4,400 tok/s
TensorRT-LLM (Triton)   |####################             | 4,000 tok/s

Two surprises. First, SGLang edges out vLLM on this trace because of its prefix-aware scheduling — the ShareGPT system prompt is shared. Second, TensorRT-LLM is below vLLM despite being NVIDIA's flagship. The ranking flips on H100 plus FP8: TensorRT-LLM with FP8 KV runs ahead of vLLM at the time of writing on raw throughput, but the integration cost is higher.

For mechanical detail on continuous batching see our vLLM deep dive and the continuous batching explainer.

Feature matrix

Feature	vLLM	TGI	SGLang	TRT-LLM	LMDeploy	Ray Serve
Continuous batching	Yes	Yes	Yes	Yes	Yes	Delegated
PagedAttention or equiv	Yes	Yes	Yes (Radix)	Yes	Yes	Delegated
Prefix caching	Yes	Yes	Yes	Yes	Partial	Delegated
Chunked prefill	Yes	Yes	Yes	Yes	Yes	Delegated
FP8 KV cache	Yes	Yes	Yes	Yes	Yes	Delegated
Structured (JSON)	Outlines	Outlines	Native	Custom	Outlines	Delegated
Multi-LoRA	Yes	Yes	Yes	Limited	Partial	Delegated
AMD ROCm	Yes	Yes	Partial	No	No	Yes
Multi-replica autoscale	No	No	No	No	No	Yes
OpenAI-compatible API	Yes	Yes	Yes	Via Triton	Yes	Yes

Ray Serve is in a different category from the rest: it does not implement an inference engine, it orchestrates one. The standard production pattern is Ray Serve plus vLLM workers, which gives you continuous batching from vLLM and autoscaling and rolling updates from Ray Serve.

Operational cost

Time-to-first-production-deploy estimates for an experienced infra team, including TLS, auth, metrics, and rolling deploy.

Framework	First deploy	Day-2 ops	Multi-tenant ready
vLLM	1 day	Medium	With work
TGI	0.5 day	Low	Yes
SGLang	1.5 days	Medium	With work
TensorRT-LLM	1 week	High	Yes (Triton)
LMDeploy	1 day	Low	Partial
Ray Serve + vLLM	2-3 days	Medium	Yes

TGI wins on initial speed because Hugging Face built it as a turnkey HTTP service. TensorRT-LLM loses on initial speed because NVIDIA's deployment story routes through Triton plus a model repository, which is a real learning curve. The day-2 picture inverts: Triton has been ops-hardened for a decade and outage handling is mature.

Model coverage

Model family	vLLM	TGI	SGLang	TRT-LLM	LMDeploy
Llama 1/2/3/3.1	Yes	Yes	Yes	Yes	Yes
Qwen 1.5/2/2.5	Yes	Yes	Yes	Yes	Yes (best)
DeepSeek V2/V3	Yes	Yes	Yes	Yes	Yes
Mistral, Mixtral	Yes	Yes	Yes	Yes	Yes
Phi 3.5/4	Yes	Yes	Yes	Yes	Partial
Gemma 2	Yes	Yes	Yes	Yes	Partial
InternLM	Partial	Partial	Yes	Yes	Yes (best)
Vision-language (Llava)	Yes	Yes	Yes	Yes	Yes
State-space (Mamba)	Yes	Partial	Yes	No	No

vLLM has the broadest coverage. LMDeploy is the right pick if you primarily serve Qwen or InternLM. TGI has the cleanest path for vanilla Hugging Face deployments.

Structured generation

This is the area with the biggest divergence in 2026.

vLLM uses Outlines or LM Format Enforcer. Works, but disables some scheduler optimizations and costs 15 to 30 percent throughput.
SGLang has structured generation as a first-class feature with its own grammar engine. It is the right default for JSON-heavy or function-calling workloads.
TGI uses Outlines. Same tradeoff as vLLM.
TensorRT-LLM has custom logits processors. Fast but development-heavy.

If your workload is 80 percent JSON output, the SGLang advantage on structured generation is worth more than the raw throughput edge.

When each is the right call

Pick	If
vLLM	You are choosing once and want the broad default
TGI	You live in the Hugging Face ecosystem and want minimal ops
SGLang	You generate structured output or have shared prefixes
TensorRT-LLM	You are NVIDIA-pinned and chasing maximum throughput
LMDeploy	You serve Chinese open-source models primarily
Ray Serve + vLLM	You need multi-replica orchestration and autoscaling

For the higher-level architecture see our Ray vs alternatives pillar.

Citations and reading

vLLM: github.com/vllm-project/vllm and the PagedAttention paper, SOSP 2023.
TGI: github.com/huggingface/text-generation-inference.
SGLang: github.com/sgl-project/sglang and the LMSYS blog.
TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM.
Anyscale's 2025 LLM serving benchmark for the cross-framework numbers.

Routing across frameworks in production

Most teams above a certain scale do not pick one framework. They run vLLM as the workhorse, SGLang for a structured-output service, and a managed API as overflow. The piece that ties this together is a routing layer. Swfte Connect speaks the OpenAI API to clients and routes by request type — JSON-mode requests to SGLang, long-context to TensorRT-LLM, everything else to vLLM, with managed APIs as fallback. That hybrid pattern keeps each framework focused on what it is good at.

What to do this quarter

If you are still on a single framework "because that is what we picked," benchmark two more on your real traffic. The 30 to 50 percent throughput differences shown above are workload-dependent.
If you generate structured output and are not on SGLang, run a one-week SGLang spike. Most JSON-heavy teams move within a quarter.
Stand up Ray Serve in front of your vLLM replicas. Manual replica management is the most common operational pain we see in 2026 LLM platforms.
If you are NVIDIA H100-rich, schedule a TensorRT-LLM evaluation. The throughput ceiling is genuinely higher despite the integration cost.
Treat managed APIs as overflow, not primary. Self-hosted serving below break-even and managed above is a worse outcome than the inverse.

发布于technology

LLM Serving vLLM TGI SGLang TensorRT-LLM

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles