LLM serving used to be a one-horse race. By 2026 there are six serious open-source frameworks plus several managed alternatives, and choosing among them is harder than it should be because every benchmark is from a vendor with a model in the fight. This is the comparison we wish vendors published.
The criteria: throughput, latency, model coverage, operational cost, structured-generation support, and what happens when something breaks.
The shortlist
| Framework | Latest release | Maintainer | Best for |
|---|---|---|---|
| vLLM | 0.7+ | vLLM project (Berkeley + community) | General-purpose default |
| TGI | 2.4+ | Hugging Face | HF native deployments |
| SGLang | 0.4+ | LMSYS | Structured gen and complex prompts |
| TensorRT-LLM | 0.13+ | NVIDIA | NVIDIA-only, max throughput |
| LMDeploy | 0.7+ | Shanghai AI Lab | Qwen, InternLM, Chinese models |
| Ray Serve + vLLM | Ray 2.10+ | Anyscale + community | Multi-replica orchestration |
Honorable mentions: DeepSpeed-FastGen (Microsoft, narrow but fast), llama.cpp (CPU and Apple Silicon dev), MLX (Apple Silicon prod is rare but exists).
Headline throughput
Llama 3.1 8B Instruct, A100 80GB, ShareGPT trace, May 2026 versions.
SGLang 0.4 |#################################| 6,800 tok/s
vLLM 0.7 (tuned) |############################## | 6,200 tok/s
LMDeploy 0.7 |############################ | 5,800 tok/s
Ray Serve + vLLM |#############################_ | 6,050 tok/s
TGI 2.4 |###################### | 4,400 tok/s
TensorRT-LLM (Triton) |#################### | 4,000 tok/s
Two surprises. First, SGLang edges out vLLM on this trace because of its prefix-aware scheduling — the ShareGPT system prompt is shared. Second, TensorRT-LLM is below vLLM despite being NVIDIA's flagship. The ranking flips on H100 plus FP8: TensorRT-LLM with FP8 KV runs ahead of vLLM at the time of writing on raw throughput, but the integration cost is higher.
For mechanical detail on continuous batching see our vLLM deep dive and the continuous batching explainer.
Feature matrix
| Feature | vLLM | TGI | SGLang | TRT-LLM | LMDeploy | Ray Serve |
|---|---|---|---|---|---|---|
| Continuous batching | Yes | Yes | Yes | Yes | Yes | Delegated |
| PagedAttention or equiv | Yes | Yes | Yes (Radix) | Yes | Yes | Delegated |
| Prefix caching | Yes | Yes | Yes | Yes | Partial | Delegated |
| Chunked prefill | Yes | Yes | Yes | Yes | Yes | Delegated |
| FP8 KV cache | Yes | Yes | Yes | Yes | Yes | Delegated |
| Structured (JSON) | Outlines | Outlines | Native | Custom | Outlines | Delegated |
| Multi-LoRA | Yes | Yes | Yes | Limited | Partial | Delegated |
| AMD ROCm | Yes | Yes | Partial | No | No | Yes |
| Multi-replica autoscale | No | No | No | No | No | Yes |
| OpenAI-compatible API | Yes | Yes | Yes | Via Triton | Yes | Yes |
Ray Serve is in a different category from the rest: it does not implement an inference engine, it orchestrates one. The standard production pattern is Ray Serve plus vLLM workers, which gives you continuous batching from vLLM and autoscaling and rolling updates from Ray Serve.
Operational cost
Time-to-first-production-deploy estimates for an experienced infra team, including TLS, auth, metrics, and rolling deploy.
| Framework | First deploy | Day-2 ops | Multi-tenant ready |
|---|---|---|---|
| vLLM | 1 day | Medium | With work |
| TGI | 0.5 day | Low | Yes |
| SGLang | 1.5 days | Medium | With work |
| TensorRT-LLM | 1 week | High | Yes (Triton) |
| LMDeploy | 1 day | Low | Partial |
| Ray Serve + vLLM | 2-3 days | Medium | Yes |
TGI wins on initial speed because Hugging Face built it as a turnkey HTTP service. TensorRT-LLM loses on initial speed because NVIDIA's deployment story routes through Triton plus a model repository, which is a real learning curve. The day-2 picture inverts: Triton has been ops-hardened for a decade and outage handling is mature.
Model coverage
| Model family | vLLM | TGI | SGLang | TRT-LLM | LMDeploy |
|---|---|---|---|---|---|
| Llama 1/2/3/3.1 | Yes | Yes | Yes | Yes | Yes |
| Qwen 1.5/2/2.5 | Yes | Yes | Yes | Yes | Yes (best) |
| DeepSeek V2/V3 | Yes | Yes | Yes | Yes | Yes |
| Mistral, Mixtral | Yes | Yes | Yes | Yes | Yes |
| Phi 3.5/4 | Yes | Yes | Yes | Yes | Partial |
| Gemma 2 | Yes | Yes | Yes | Yes | Partial |
| InternLM | Partial | Partial | Yes | Yes | Yes (best) |
| Vision-language (Llava) | Yes | Yes | Yes | Yes | Yes |
| State-space (Mamba) | Yes | Partial | Yes | No | No |
vLLM has the broadest coverage. LMDeploy is the right pick if you primarily serve Qwen or InternLM. TGI has the cleanest path for vanilla Hugging Face deployments.
Structured generation
This is the area with the biggest divergence in 2026.
- vLLM uses Outlines or LM Format Enforcer. Works, but disables some scheduler optimizations and costs 15 to 30 percent throughput.
- SGLang has structured generation as a first-class feature with its own grammar engine. It is the right default for JSON-heavy or function-calling workloads.
- TGI uses Outlines. Same tradeoff as vLLM.
- TensorRT-LLM has custom logits processors. Fast but development-heavy.
If your workload is 80 percent JSON output, the SGLang advantage on structured generation is worth more than the raw throughput edge.
When each is the right call
| Pick | If |
|---|---|
| vLLM | You are choosing once and want the broad default |
| TGI | You live in the Hugging Face ecosystem and want minimal ops |
| SGLang | You generate structured output or have shared prefixes |
| TensorRT-LLM | You are NVIDIA-pinned and chasing maximum throughput |
| LMDeploy | You serve Chinese open-source models primarily |
| Ray Serve + vLLM | You need multi-replica orchestration and autoscaling |
For the higher-level architecture see our Ray vs alternatives pillar.
Citations and reading
- vLLM: github.com/vllm-project/vllm and the PagedAttention paper, SOSP 2023.
- TGI: github.com/huggingface/text-generation-inference.
- SGLang: github.com/sgl-project/sglang and the LMSYS blog.
- TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM.
- Anyscale's 2025 LLM serving benchmark for the cross-framework numbers.
Routing across frameworks in production
Most teams above a certain scale do not pick one framework. They run vLLM as the workhorse, SGLang for a structured-output service, and a managed API as overflow. The piece that ties this together is a routing layer. Swfte Connect speaks the OpenAI API to clients and routes by request type — JSON-mode requests to SGLang, long-context to TensorRT-LLM, everything else to vLLM, with managed APIs as fallback. That hybrid pattern keeps each framework focused on what it is good at.
What to do this quarter
- If you are still on a single framework "because that is what we picked," benchmark two more on your real traffic. The 30 to 50 percent throughput differences shown above are workload-dependent.
- If you generate structured output and are not on SGLang, run a one-week SGLang spike. Most JSON-heavy teams move within a quarter.
- Stand up Ray Serve in front of your vLLM replicas. Manual replica management is the most common operational pain we see in 2026 LLM platforms.
- If you are NVIDIA H100-rich, schedule a TensorRT-LLM evaluation. The throughput ceiling is genuinely higher despite the integration cost.
- Treat managed APIs as overflow, not primary. Self-hosted serving below break-even and managed above is a worse outcome than the inverse.