Updated May 15, 2026 · 8 min read

AI API pricing, all in one place

Live per-1M-token rates for every major LLM API in May 2026. Each provider page covers every tier, prompt-caching discount, batch pricing, context window, and four concrete cost scenarios you can copy into your finance model.

How to read 2026 LLM pricing

Every major provider now offers three or four model tiers per generation, with a roughly 10× price spread between the smallest and the flagship. The headline number. "per 1M tokens", is only the starting point. The realised cost on production workloads depends on three multipliers stacked on top:

TierApproximate price rangeProduction fit
Frontier (Opus / Pro / Gemini Pro)$3-30 input · $10-180 output per 1MHardest reasoning, novel code, agentic planning
Workhorse (Sonnet / GPT-5.5 / Gemini 3.0)$1-5 input · $5-30 output per 1MProduction default for most workloads
Small (Haiku / mini / Flash / DeepSeek V4)$0.10-1 input · $0.25-4 output per 1MHigh-volume classification, extraction, routing
Nano / Open Flash<$0.20 input · <$0.50 output per 1MExtreme throughput, simple structured output

Multiplier 1: prompt caching. Every major provider now offers a discount of 75-90% on the cached prefix of a request when the same tokens are reused within a TTL window. For workloads that share a system prompt or pass the same codebase repeatedly (most coding agents, most RAG chat apps), the cached input price is the right number to plan against; not the headline rate.

Multiplier 2, and batch inference. A flat 50% discount in exchange for accepting up to 24-hour completion latency. For overnight back-office processing, model evaluation runs, and synthetic data generation, batching halves spend with no quality change.

Multiplier 3. routing. The biggest savings come from not running every request through the flagship. A gateway routes simple chat to the small tier, hard reasoning to the flagship, and falls back to a second provider when the first degrades. Real production fleets settle into a 70/25/5 ratio (small / workhorse / flagship) and cut model spend 60-80% versus naive single-model deployments.

Cost scenarios at a glance

Four representative workloads, costed against each provider's 2026 mainline tier. Numbers exclude caching and batch discounts, apply 0.1-0.5× as a realistic factor for typical production usage.

WorkloadGPT-5.5Claude Sonnet 4Gemini 3.0DeepSeek V4
1M chat in / 100K out$8.00$4.50$1.63$0.65
10M RAG in / 1M out$80.00$45.00$16.25$6.50
Agent loop: 50K context × 8 turns × 1K out~$13~$6~$2.30~$1
Long-context; 800K in / 5K out$4.15$2.48$1.02$0.41

FAQ

Which LLM API is cheapest in 2026?

On raw per-token cost, DeepSeek V4 Flash leads at $0.14 input / $0.28 output per 1M tokens. roughly 1/100th of GPT-5.5 Pro. For mainstream production workloads, Gemini 2.5 Flash and Claude Haiku 3.5 are the next cheapest tier, both under $1 per 1M input tokens.

How do prompt caching and batch discounts work?

Most major providers offer two cost mechanisms. Prompt caching returns a 75-90% discount on the cached prefix portion of a request when the same tokens are reused within a TTL window (typically 5 min – 1 hr). Batch inference offers a flat 50% discount in exchange for accepting up to 24-hour completion latency. Stacked, they can cut effective token spend by 70-95% on cacheable workloads.

What is the cheapest way to use GPT-5.5 in production?

Use the non-Pro GPT-5.5 model ($5 input / $30 output) plus prompt caching for repeated prefixes, plus batch mode for latency-tolerant work. For high volume, a gateway like Swfte routes the request to GPT-5.5 mini ($0.50 / $2) when complexity is low and only escalates to GPT-5.5 or Pro when needed.

Are open-weights models actually cheaper?

Yes on the hosted endpoints, DeepSeek V4 Pro at $1.74/$3.48 is ~1/8 the cost of GPT-5.5 Pro for Arena-comparable quality. Self-hosting can be cheaper still at high throughput (10M+ tokens/day) on rented GPUs, but only after factoring in operations, GPU utilisation, and serving framework cost (vLLM, TGI, TRT-LLM).

How much does typical agentic usage cost per month?

For an engineering team of 10 running agentic coding (Cursor + Claude Code) and back-office agents through a gateway: $500-2,000/mo on models, plus the platform fee. The variance comes from cache hit rate (50-90% on repeated codebases) and whether you route hard tasks to Opus vs Sonnet.

How often do API prices change?

Frontier model prices have fallen ~40% per generation over the last two years and step-change down with each new model family. Cheaper tiers (mini, flash, nano, haiku) typically launch at 1/10th to 1/30th of the flagship price. Plan budgets quarterly, not annually.

What about fine-tuning and dedicated capacity?

OpenAI, Google, and DeepSeek offer fine-tuning on most tiers; Anthropic does not. Dedicated capacity (Provisioned Throughput on Anthropic, Provisioned Capacity Units on OpenAI / Google) prices at a flat hourly rate independent of token count: economic above ~30-50% sustained utilisation.

Do these prices include vision and audio?

Vision tokens are usually billed per-image at a per-tile rate (one image = 85-1,500 tokens depending on resolution). Audio inputs are billed per-minute on OpenAI Realtime ($0.06/min input, $0.24/min output for GPT-5.5 Realtime) and per-token after transcription elsewhere.

Run every API through one gateway

Swfte routes traffic across Anthropic, OpenAI, Google, DeepSeek, Grok, and self-hosted models. Per-team cost ceilings, prompt caching, audit log.

Free tier · OpenAI-compatible API · SOC2 Type II · On-prem available