AI API pricing, all in one place
Live per-1M-token rates for every major LLM API in May 2026. Each provider page covers every tier, prompt-caching discount, batch pricing, context window, and four concrete cost scenarios you can copy into your finance model.
Provider pricing pages
OpenAI
OpenAI API pricing
Flagship: GPT-5.5 Pro at $30.00 / $180.00 per 1M. Cheapest: GPT-5 nano at $0.05 / $0.40 per 1M.
8 model tiers · prompt caching available · batch available
Anthropic
Claude API pricing
Flagship: Claude Opus 4.7 at $5.00 / $25.00 per 1M. Cheapest: Claude Haiku 3.5 at $0.80 / $4.00 per 1M.
3 model tiers · prompt caching available · batch available
Google DeepMind
Gemini API pricing
Flagship: Gemini 3.1 Pro at $3.50 / $10.50 per 1M. Cheapest: Gemini 2.5 Flash at $0.07 / $0.30 per 1M.
4 model tiers · prompt caching available · batch available
xAI
Grok API pricing
Flagship: Grok 4 at $5.00 / $15.00 per 1M. Cheapest: Grok 4 mini at $0.30 / $0.50 per 1M.
3 model tiers · prompt caching not available · batch not available
DeepSeek
DeepSeek API pricing
Flagship: DeepSeek V4 Pro at $1.74 / $3.48 per 1M. Cheapest: DeepSeek V4 Flash at $0.14 / $0.28 per 1M.
4 model tiers · prompt caching available · batch not available
How to read 2026 LLM pricing
Every major provider now offers three or four model tiers per generation, with a roughly 10× price spread between the smallest and the flagship. The headline number. "per 1M tokens", is only the starting point. The realised cost on production workloads depends on three multipliers stacked on top:
| Tier | Approximate price range | Production fit |
|---|---|---|
| Frontier (Opus / Pro / Gemini Pro) | $3-30 input · $10-180 output per 1M | Hardest reasoning, novel code, agentic planning |
| Workhorse (Sonnet / GPT-5.5 / Gemini 3.0) | $1-5 input · $5-30 output per 1M | Production default for most workloads |
| Small (Haiku / mini / Flash / DeepSeek V4) | $0.10-1 input · $0.25-4 output per 1M | High-volume classification, extraction, routing |
| Nano / Open Flash | <$0.20 input · <$0.50 output per 1M | Extreme throughput, simple structured output |
Multiplier 1: prompt caching. Every major provider now offers a discount of 75-90% on the cached prefix of a request when the same tokens are reused within a TTL window. For workloads that share a system prompt or pass the same codebase repeatedly (most coding agents, most RAG chat apps), the cached input price is the right number to plan against; not the headline rate.
Multiplier 2, and batch inference. A flat 50% discount in exchange for accepting up to 24-hour completion latency. For overnight back-office processing, model evaluation runs, and synthetic data generation, batching halves spend with no quality change.
Multiplier 3. routing. The biggest savings come from not running every request through the flagship. A gateway routes simple chat to the small tier, hard reasoning to the flagship, and falls back to a second provider when the first degrades. Real production fleets settle into a 70/25/5 ratio (small / workhorse / flagship) and cut model spend 60-80% versus naive single-model deployments.
Cost scenarios at a glance
Four representative workloads, costed against each provider's 2026 mainline tier. Numbers exclude caching and batch discounts, apply 0.1-0.5× as a realistic factor for typical production usage.
| Workload | GPT-5.5 | Claude Sonnet 4 | Gemini 3.0 | DeepSeek V4 |
|---|---|---|---|---|
| 1M chat in / 100K out | $8.00 | $4.50 | $1.63 | $0.65 |
| 10M RAG in / 1M out | $80.00 | $45.00 | $16.25 | $6.50 |
| Agent loop: 50K context × 8 turns × 1K out | ~$13 | ~$6 | ~$2.30 | ~$1 |
| Long-context; 800K in / 5K out | $4.15 | $2.48 | $1.02 | $0.41 |
FAQ
Which LLM API is cheapest in 2026?
On raw per-token cost, DeepSeek V4 Flash leads at $0.14 input / $0.28 output per 1M tokens. roughly 1/100th of GPT-5.5 Pro. For mainstream production workloads, Gemini 2.5 Flash and Claude Haiku 3.5 are the next cheapest tier, both under $1 per 1M input tokens.
How do prompt caching and batch discounts work?
Most major providers offer two cost mechanisms. Prompt caching returns a 75-90% discount on the cached prefix portion of a request when the same tokens are reused within a TTL window (typically 5 min – 1 hr). Batch inference offers a flat 50% discount in exchange for accepting up to 24-hour completion latency. Stacked, they can cut effective token spend by 70-95% on cacheable workloads.
What is the cheapest way to use GPT-5.5 in production?
Use the non-Pro GPT-5.5 model ($5 input / $30 output) plus prompt caching for repeated prefixes, plus batch mode for latency-tolerant work. For high volume, a gateway like Swfte routes the request to GPT-5.5 mini ($0.50 / $2) when complexity is low and only escalates to GPT-5.5 or Pro when needed.
Are open-weights models actually cheaper?
Yes on the hosted endpoints, DeepSeek V4 Pro at $1.74/$3.48 is ~1/8 the cost of GPT-5.5 Pro for Arena-comparable quality. Self-hosting can be cheaper still at high throughput (10M+ tokens/day) on rented GPUs, but only after factoring in operations, GPU utilisation, and serving framework cost (vLLM, TGI, TRT-LLM).
How much does typical agentic usage cost per month?
For an engineering team of 10 running agentic coding (Cursor + Claude Code) and back-office agents through a gateway: $500-2,000/mo on models, plus the platform fee. The variance comes from cache hit rate (50-90% on repeated codebases) and whether you route hard tasks to Opus vs Sonnet.
How often do API prices change?
Frontier model prices have fallen ~40% per generation over the last two years and step-change down with each new model family. Cheaper tiers (mini, flash, nano, haiku) typically launch at 1/10th to 1/30th of the flagship price. Plan budgets quarterly, not annually.
What about fine-tuning and dedicated capacity?
OpenAI, Google, and DeepSeek offer fine-tuning on most tiers; Anthropic does not. Dedicated capacity (Provisioned Throughput on Anthropic, Provisioned Capacity Units on OpenAI / Google) prices at a flat hourly rate independent of token count: economic above ~30-50% sustained utilisation.
Do these prices include vision and audio?
Vision tokens are usually billed per-image at a per-tile rate (one image = 85-1,500 tokens depending on resolution). Audio inputs are billed per-minute on OpenAI Realtime ($0.06/min input, $0.24/min output for GPT-5.5 Realtime) and per-token after transcription elsewhere.
Run every API through one gateway
Swfte routes traffic across Anthropic, OpenAI, Google, DeepSeek, Grok, and self-hosted models. Per-team cost ceilings, prompt caching, audit log.
Free tier · OpenAI-compatible API · SOC2 Type II · On-prem available