Online inference (a user typing, a chatbot replying) is the part of LLM serving that gets the marketing. Batch inference (process 50 million documents overnight) is the part that pays for it. The cost gap between the two is enormous: well-tuned batch inference costs 5x to 10x less per token than online serving, and 30x less than the synchronous OpenAI API.
This guide covers what batch inference looks like in 2026, how to build it cheaply, and the mistakes that turn a 70 percent saving into a 20 percent saving.
Why batch is so much cheaper
Three economic effects compound.
Higher GPU utilization. Online serving needs spare capacity for traffic spikes. Batch jobs run a queue at saturation. Sustained 90 percent versus sustained 35 percent is a 2.5x cost difference on the same hardware.
Larger effective batch sizes. Online traffic gives you maybe 50 concurrent active requests. A batch job lets you submit 5,000. The model amortizes weight loading over far more tokens per forward pass.
Spot instances are usable. Online serving cannot tolerate spot interruption mid-request. Batch inference checkpoints and resumes; spot is fine. AWS p4d.24xlarge spot is roughly 30 percent of on-demand in 2026.
Add these together and a self-hosted batch pipeline running vLLM on Ray Data is consistently 5x to 10x cheaper than the same workload through OpenAI's batch API.
Cost per million tokens at scale
Llama 3.1 70B Instruct, 1024 input + 256 output tokens.
| Pipeline | $/1M tokens | Notes |
|---|---|---|
| Self-hosted vLLM, A100 spot, Ray Data | $0.42 | 90% util, FP8 KV |
| Self-hosted SGLang, H100 spot | $0.31 | Same model on better hw |
| Together AI batch tier | $0.88 | Half the price of online |
| OpenAI batch API (GPT-4o mini) | $0.30 | 50% off online price |
| OpenAI online API (GPT-4o mini) | $0.60 | Reference |
| Anthropic batch (Claude Haiku) | $0.40 | 50% off online |
| Bedrock provisioned throughput | $1.20 | Higher floor, predictable |
The OpenAI batch API is interesting: at $0.30/M for GPT-4o mini, it competes with self-hosted operationally. The math flips for large models — Llama 70B on your hardware is much cheaper than the equivalent on a managed batch API.
A reference batch pipeline
The pattern that most teams settle on by 2026:
- Ingest — input documents land in S3 or GCS as Parquet, JSONL, or per-row files.
- Plan — Ray Data reads the dataset and shards across worker actors.
- Embed or generate — each Ray actor runs a vLLM engine instance; rows flow through it via continuous batching.
- Persist — outputs written back to object storage with run metadata.
- Validate — a small set of canary rows are checked against an oracle or eval suite.
The control plane is Ray; the inference engine is vLLM, SGLang, or TGI. For the framework comparison, see our LLM serving frameworks 2026 guide and the broader Ray vs alternatives pillar.
Throughput on a real cluster
Llama 3.1 70B, FP8 quantized, 8 x H100 80GB, ShareGPT-distributed inputs.
Naive HF pipeline |## | 720 tok/s
TGI 2.4 |#################### | 7,200 tok/s
vLLM 0.7 default |########################## | 9,400 tok/s
vLLM tuned + Ray Data |############################## | 11,000 tok/s
SGLang 0.4 + Ray Data |################################# | 12,400 tok/s
Going from a naive pipeline to a tuned Ray Data plus vLLM stack is 15x throughput on the same hardware. That is the difference between $4 per million tokens and $0.30.
The four mistakes that erase the savings
Running batch through an online endpoint. Hitting your vLLM HTTP server with a 5-million-row dataset gets you online-tier scheduling and online-tier latency variance. Use the offline LLM engine from vLLM directly, or use Ray Data's vLLM connector. You will see 2x throughput improvement over HTTP-mediated batches.
Padding by accident. Some inference frameworks pad inputs to the maximum batch length. With variable-length documents this is a 2x to 4x token waste. Make sure your engine supports variable-length batching (vLLM, SGLang, TGI all do).
Wrong instance type. A100 80GB is the default and often wrong. For 70B models in FP8, a single H100 fits more concurrency than two A100s and is cheaper per token despite the higher hourly price. For 8B models, an L40S or even a 4090 in FP8 is the sweet spot.
Not using spot. The two reasons people skip spot are checkpointing complexity and interruption rate. Both are solved problems in 2026: Ray Train handles checkpointing, and AWS spot interruption rates on p4d are below 5 percent per day. The 70 percent compute cost reduction is worth the operational work.
Embedding workloads are even better
Embeddings are pure forward-pass with no decode loop. Continuous batching does not help, but throughput is 5 to 10x higher than generation.
| Model | Tokens/sec on A100 | Cost / 1M embed |
|---|---|---|
| BGE-large (335M) | 320,000 | $0.004 |
| E5-mistral-7B | 38,000 | $0.030 |
| Cohere Embed v3 | 14,000 | $0.080 |
| OpenAI text-embedding-3-large | n/a managed | $0.130 |
A self-hosted BGE-large pipeline embeds the entire English Wikipedia in roughly 4 hours on one A100, for under $5 of compute. Compare to $130 through a managed API.
When to use a managed batch API anyway
Some teams should not run their own batch pipeline.
- One-off jobs under 100M tokens. Setup cost is real. OpenAI batch is fine.
- No GPU infrastructure team. If hiring infra is not on the table, managed wins.
- Compliance demands a specific provider. Bedrock or Azure OpenAI is the only option in some regulated industries.
The break-even is roughly 1 billion tokens per month. Below that, managed batch is operationally cheaper. Above that, self-hosted dominates economically.
Routing batch and online together
Production systems usually run both. Online serving uses managed APIs for tail traffic, self-hosted for the hot path. Batch runs nightly on a separate Ray cluster against the same model weights. A gateway like Swfte Connect sits in front of online inference and routes by prompt characteristics — large prompt to local Llama, complex reasoning to Anthropic — while batch jobs hit the model engines directly without the gateway. This separation keeps the gateway control plane simple and the batch path zero-overhead.
For the embedding side specifically, see our vLLM continuous batching deep dive for tuning the engine itself.
Canonical references
- The vLLM project repository at github.com/vllm-project/vllm for the offline engine API.
- Ray Data documentation for the Ray Data batch inference pattern.
- The Anyscale 2025 Batch LLM cost study (link from their docs.ray.io blog) for reproducible benchmarks.
What to do this quarter
- List every workload that calls an LLM more than 1 million times per month. If it is not latency-sensitive, it is a batch candidate.
- Move the largest batch workload to a self-hosted Ray Data plus vLLM pipeline this quarter. Expect 5x to 10x cost reduction.
- Switch managed online traffic to the batch tier on OpenAI or Anthropic where SLAs allow. 50 percent off for free.
- Use spot instances. The interruption-rate fear is overblown in 2026; the savings are not.
- Track cost per million tokens as a first-class metric, separate from request count. It will keep the right pipeline in production.