|
English

LLM serving used to be a one-horse race. By 2026 there are six serious open-source frameworks plus several managed alternatives, and choosing among them is harder than it should be because every benchmark is from a vendor with a model in the fight. This is the comparison we wish vendors published.

The criteria: throughput, latency, model coverage, operational cost, structured-generation support, and what happens when something breaks.

The shortlist

FrameworkLatest releaseMaintainerBest for
vLLM0.7+vLLM project (Berkeley + community)General-purpose default
TGI2.4+Hugging FaceHF native deployments
SGLang0.4+LMSYSStructured gen and complex prompts
TensorRT-LLM0.13+NVIDIANVIDIA-only, max throughput
LMDeploy0.7+Shanghai AI LabQwen, InternLM, Chinese models
Ray Serve + vLLMRay 2.10+Anyscale + communityMulti-replica orchestration

Honorable mentions: DeepSpeed-FastGen (Microsoft, narrow but fast), llama.cpp (CPU and Apple Silicon dev), MLX (Apple Silicon prod is rare but exists).

Headline throughput

Llama 3.1 8B Instruct, A100 80GB, ShareGPT trace, May 2026 versions.

SGLang 0.4              |#################################| 6,800 tok/s
vLLM 0.7 (tuned)        |##############################   | 6,200 tok/s
LMDeploy 0.7            |############################     | 5,800 tok/s
Ray Serve + vLLM        |#############################_   | 6,050 tok/s
TGI 2.4                 |######################           | 4,400 tok/s
TensorRT-LLM (Triton)   |####################             | 4,000 tok/s

Two surprises. First, SGLang edges out vLLM on this trace because of its prefix-aware scheduling — the ShareGPT system prompt is shared. Second, TensorRT-LLM is below vLLM despite being NVIDIA's flagship. The ranking flips on H100 plus FP8: TensorRT-LLM with FP8 KV runs ahead of vLLM at the time of writing on raw throughput, but the integration cost is higher.

For mechanical detail on continuous batching see our vLLM deep dive and the continuous batching explainer.

Feature matrix

FeaturevLLMTGISGLangTRT-LLMLMDeployRay Serve
Continuous batchingYesYesYesYesYesDelegated
PagedAttention or equivYesYesYes (Radix)YesYesDelegated
Prefix cachingYesYesYesYesPartialDelegated
Chunked prefillYesYesYesYesYesDelegated
FP8 KV cacheYesYesYesYesYesDelegated
Structured (JSON)OutlinesOutlinesNativeCustomOutlinesDelegated
Multi-LoRAYesYesYesLimitedPartialDelegated
AMD ROCmYesYesPartialNoNoYes
Multi-replica autoscaleNoNoNoNoNoYes
OpenAI-compatible APIYesYesYesVia TritonYesYes

Ray Serve is in a different category from the rest: it does not implement an inference engine, it orchestrates one. The standard production pattern is Ray Serve plus vLLM workers, which gives you continuous batching from vLLM and autoscaling and rolling updates from Ray Serve.

Operational cost

Time-to-first-production-deploy estimates for an experienced infra team, including TLS, auth, metrics, and rolling deploy.

FrameworkFirst deployDay-2 opsMulti-tenant ready
vLLM1 dayMediumWith work
TGI0.5 dayLowYes
SGLang1.5 daysMediumWith work
TensorRT-LLM1 weekHighYes (Triton)
LMDeploy1 dayLowPartial
Ray Serve + vLLM2-3 daysMediumYes

TGI wins on initial speed because Hugging Face built it as a turnkey HTTP service. TensorRT-LLM loses on initial speed because NVIDIA's deployment story routes through Triton plus a model repository, which is a real learning curve. The day-2 picture inverts: Triton has been ops-hardened for a decade and outage handling is mature.

Model coverage

Model familyvLLMTGISGLangTRT-LLMLMDeploy
Llama 1/2/3/3.1YesYesYesYesYes
Qwen 1.5/2/2.5YesYesYesYesYes (best)
DeepSeek V2/V3YesYesYesYesYes
Mistral, MixtralYesYesYesYesYes
Phi 3.5/4YesYesYesYesPartial
Gemma 2YesYesYesYesPartial
InternLMPartialPartialYesYesYes (best)
Vision-language (Llava)YesYesYesYesYes
State-space (Mamba)YesPartialYesNoNo

vLLM has the broadest coverage. LMDeploy is the right pick if you primarily serve Qwen or InternLM. TGI has the cleanest path for vanilla Hugging Face deployments.

Structured generation

This is the area with the biggest divergence in 2026.

  • vLLM uses Outlines or LM Format Enforcer. Works, but disables some scheduler optimizations and costs 15 to 30 percent throughput.
  • SGLang has structured generation as a first-class feature with its own grammar engine. It is the right default for JSON-heavy or function-calling workloads.
  • TGI uses Outlines. Same tradeoff as vLLM.
  • TensorRT-LLM has custom logits processors. Fast but development-heavy.

If your workload is 80 percent JSON output, the SGLang advantage on structured generation is worth more than the raw throughput edge.

When each is the right call

PickIf
vLLMYou are choosing once and want the broad default
TGIYou live in the Hugging Face ecosystem and want minimal ops
SGLangYou generate structured output or have shared prefixes
TensorRT-LLMYou are NVIDIA-pinned and chasing maximum throughput
LMDeployYou serve Chinese open-source models primarily
Ray Serve + vLLMYou need multi-replica orchestration and autoscaling

For the higher-level architecture see our Ray vs alternatives pillar.

Citations and reading

Routing across frameworks in production

Most teams above a certain scale do not pick one framework. They run vLLM as the workhorse, SGLang for a structured-output service, and a managed API as overflow. The piece that ties this together is a routing layer. Swfte Connect speaks the OpenAI API to clients and routes by request type — JSON-mode requests to SGLang, long-context to TensorRT-LLM, everything else to vLLM, with managed APIs as fallback. That hybrid pattern keeps each framework focused on what it is good at.

What to do this quarter

  1. If you are still on a single framework "because that is what we picked," benchmark two more on your real traffic. The 30 to 50 percent throughput differences shown above are workload-dependent.
  2. If you generate structured output and are not on SGLang, run a one-week SGLang spike. Most JSON-heavy teams move within a quarter.
  3. Stand up Ray Serve in front of your vLLM replicas. Manual replica management is the most common operational pain we see in 2026 LLM platforms.
  4. If you are NVIDIA H100-rich, schedule a TensorRT-LLM evaluation. The throughput ceiling is genuinely higher despite the integration cost.
  5. Treat managed APIs as overflow, not primary. Self-hosted serving below break-even and managed above is a worse outcome than the inverse.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.