Updated May 15, 2026 · 8 min read

OpenAI API pricing (May 2026)

TL;DR: OpenAI runs a four-tier model line in 2026. GPT-5.5 Pro (flagship), GPT-5.5 (workhorse), GPT-5.5 mini (cheap), GPT-5 nano (cheapest). Per-1M input/output rates span 600× across the line. Automatic prompt caching at 75% off and batch inference at 50% off make the realised cost much lower than the headline rates.

Every OpenAI model and its per-token price

ModelInput / 1MCached input / 1MOutput / 1MContextNotes
GPT-5.5 Pro$30.00$7.500$180.001MReasoning + multimodal flagship. AAII 59 — #1 reasoning leaderboard.
GPT-5.5$5.00$1.250$30.00256KDefault tier. Strong general-purpose with tool use + voice.
GPT-5.5 mini$0.50$0.125$2.00128KCheapest GPT-5 tier — good for classification, simple chat.
GPT-5 nano$0.05$0.013$0.40128KSmallest. Routing, classification, extraction at high throughput.
GPT-4o$2.50$1.250$10.00128KLegacy multimodal flagship. Still supported.
GPT-4o mini$0.15$0.075$0.60128KLegacy small tier. Cheap, fast, multimodal.
o3 (reasoning)$10.00$2.500$40.00200KDeep reasoning model — math, code, research.
gpt-oss-120b$1.00$3.00128KOpen-weights. Self-hostable; benchmark pricing on hosted endpoints varies.

All prices in USD per 1 million tokens. Last reviewed 2026-05-15. Provider pricing pages are authoritative, confirm before contracting.

How OpenAI pricing actually works

OpenAI API pricing in 2026 follows the same shape across every model family, a per-1M-token input rate, a per-1M-token output rate (typically 4-6× input), an automatic cached-input rate at 25% of the headline, and an optional batch mode at 50% off. The flagship-to-cheapest spread is roughly 600× ($30 input on GPT-5.5 Pro down to $0.05 on GPT-5 nano), which means model routing is far more cost-relevant than negotiating volume discounts.

The OpenAI API is the most widely deployed LLM API in production. Roughly 65% of developers using AI tools touch it daily according to the 2025 Stack Overflow survey, and the API powers most of the Fortune 500's customer-facing AI products. Microsoft's Azure OpenAI variant serves the same models inside an enterprise procurement and compliance surface, which is the default in regulated industries.

OpenAI offers 50% batch discount, automatic prompt caching, and ZDR on enterprise contracts. Azure OpenAI exposes the same models with VPC residency.

Prompt caching: the 90% discount most teams ignore

OpenAI's prompt caching is automatic: no opt-in required. Any prefix of 1,024 or more tokens that has been seen in the last 5-10 minutes is served from cache at 25% of the normal input rate. The cache hit rate depends on how prompts are structured: stable content (system prompt, tools, RAG context) at the start, dynamic content at the end. Reverse that and you bust the cache on every turn.

On a typical coding agent run that re-sends the same 200K-token codebase across 10 turns, prompt caching reduces effective input cost by 80-90%. The cached-input column in the table above is the right number to plug into a production budget; the headline input rate is the "new conversation" rate, not the steady-state rate.

Batch inference, and half-price overnight

The batch API accepts a JSONL file of up to 50,000 requests and returns completions within 24 hours at 50% of normal pricing for both input and output. There is no minimum size. The right workloads are anything where a few hours of latency is fine; nightly evaluation runs, document enrichment pipelines, synthetic training data generation, large-scale classification.

Batch + cache stack. The combined effective rate for a cache-warm, batched call is often 5-10% of the headline price. For workloads like nightly eval suites, large-scale classification, document enrichment, and synthetic data generation, batching is free money.

Four real production cost scenarios

WorkloadDetailHeadline costWith cacheWith batch
Chat (GPT-5.5)1M tokens in, 100K tokens out$5.00 + $3.00 = $8.00$1.25 + $3.00 = $4.25$2.50 + $1.50 = $4.00
RAG (GPT-5.5)10M context in (80% repeat), 1M out$50.00 + $30.00 = $80.00$12.50 + $30.00 = $42.50$25.00 + $15.00 = $40.00
Agent loop (GPT-5.5)50K context × 8 turns × 1K out~$13~$5N/A, and interactive
High-volume classification (GPT-5.5 mini)100M in, 5M out (extraction)$50 + $10 = $60$12.50 + $10 = $22.50$25 + $5 = $30

The routing pattern that cuts OpenAI spend 60-80%

The dominant pattern in 2026 is intent-based routing. Simple queries, and chat, classification, extraction. route to GPT-5 nano or GPT-5.5 mini at sub-cent cost. Workhorse queries route to GPT-5.5 at $5/$30. Hard-reasoning queries (math, multi-step planning, code refactoring) promote to GPT-5.5 Pro. The routing decision is usually made by a tiny classifier model running in front, or by tool-call signals from the previous turn.

A typical production fleet settles into a 70/25/5 split. 70% of requests handled by the smallest competent tier, 25% by the mid-tier workhorse, 5% promoted to the flagship. Done well, this cuts model spend 60-80% versus naive single-model use without any measurable quality drop on the bulk of requests.

With an AI gateway in front, the routing rule is one config block: declare a default model, declare promotion triggers, declare a fallback to a second provider for availability. Applications keep using a single OpenAI-compatible endpoint. See Swfte for a managed runtime that bundles the gateway, observability, eval, and per-team cost ceilings.

Enterprise considerations

Enterprise procurement for OpenAI typically goes through one of three channels. Direct OpenAI Enterprise contracts include zero data retention by default, dedicated capacity options, and a customer success motion. Azure OpenAI exposes the same models with Microsoft enterprise terms, regional residency, and integration with Entra ID / Defender. AWS Bedrock does not offer GPT-5.5, Bedrock is for Anthropic, Meta, Mistral, and AI21 models. For multi-cloud or VPC residency, route through a gateway that can address all three.

  • Prompt caching: Available use it from day one; the headline rate is misleading without it.
  • Batch inference: Available, 50% discount, up to 24h SLA.
  • Fine-tuning: Supported on most tiers.
  • On-prem / VPC: Available via Bedrock / Vertex / Azure or direct VPC contract.
  • Zero data retention: Available; default on enterprise contracts.

How OpenAI compares to the rest of the market

Against Claude (Anthropic), OpenAI is broader: multimodal, voice, image generation, fine-tuning, open weights; but Claude is preferred for coding and agent loops. Against Gemini (Google), GPT-5.5 leads reasoning but Gemini wins on price (Gemini 3.0 is 4× cheaper than GPT-5.5 at comparable quality on most general workloads). Against DeepSeek, GPT-5.5 leads quality on the hardest benchmarks but DeepSeek V4 Pro is ~1/8 the cost for similar Arena Elo on bulk workloads.

For a full side-by-side, see the API pricing index and the AI model leaderboard for quality / speed / value rankings.

Frequently asked questions about OpenAI API pricing

What is the OpenAI API pricing in 2026?

GPT-5.5 Pro is $30 per 1M input tokens and $180 per 1M output tokens. The non-Pro GPT-5.5 is $5 / $30. GPT-5.5 mini is $0.50 / $2. GPT-5 nano is $0.05 / $0.40. All tiers support automatic prompt caching at 75% off and batch inference at 50% off.

How does GPT-5.5 pricing compare to GPT-4o?

GPT-5.5 ($5 / $30) is double the input price of GPT-4o ($2.50 / $10) but with substantially better reasoning, tool use, and a much larger 256K context. Most production teams have moved off GPT-4o by mid-2026 unless cost is paramount and quality is sufficient.

Is GPT-5.5 Pro worth 6× the price of GPT-5.5?

For most workloads, no. GPT-5.5 Pro shines on the hardest reasoning tasks (AAII 59, the leaderboard #1) and 1M context. The standard pattern: use GPT-5.5 as default and route to Pro only when a reasoning-difficulty signal trips. usually 5-10% of traffic.

How does OpenAI prompt caching work?

OpenAI prompt caching is automatic, any prefix of 1,024+ tokens that has been sent in the last 5-10 minutes is served from cache at 25% of the normal input rate. No code changes required. To maximise hit rate, structure prompts with stable prefixes (system prompt, tools, RAG context) before any dynamic content.

What is the batch API discount?

50% off both input and output, in exchange for a 24-hour completion SLA. Submit a JSONL file of up to 50,000 requests via the batch endpoint. Best for evaluation runs, content generation pipelines, and back-office processing where seconds-of-latency don't matter.

Does OpenAI offer fine-tuning?

Yes: on GPT-5.5 mini and most legacy models. Fine-tuning pricing is per-1M tokens of training data plus a per-1M-tokens premium on inference (typically 2-3× the base model rate). For most cases, prompt engineering + retrieval is cheaper than fine-tuning.

Can I get on-prem GPT-5.5?

Through Azure OpenAI for VPC residency on Microsoft cloud, with the same models and pricing. True on-prem (air-gapped) is not currently offered for GPT-5.5; only gpt-oss-120b and 20b have open weights for full self-hosting.

What is gpt-oss?

OpenAI's open-weights line; gpt-oss-120b (the big one) and gpt-oss-20b. Apache 2.0 licensed for commercial use. Quality lands between GPT-4o and GPT-5.5 mini at fraction of the cost on hosted endpoints, and fully self-hostable on 8× H100 or smaller hardware for the 20B variant.

What about ChatGPT Enterprise vs the API?

They're different products. ChatGPT Enterprise is a managed chat app priced per seat with included usage. The API is per-token raw access. For developer use cases, the API plus a gateway is always cheaper than per-seat ChatGPT. For end-user productivity, ChatGPT Enterprise is the right unit.

How can I reduce OpenAI API costs?

Five levers, in order of impact: (1) Route most traffic to GPT-5.5 mini or nano. (2) Turn on prompt caching and structure prompts for cache hits. (3) Batch any latency-tolerant workload for 50% off. (4) Use a multi-provider gateway to fall back to a cheaper Claude / DeepSeek / Gemini tier when quality permits. (5) Set per-team budget ceilings so runaway agents don't burn the month's budget.

Run OpenAI on a gateway you control

Swfte routes traffic across every major provider, enforces prompt caching, applies per-team budgets, and logs every request for audit. OpenAI-compatible API. Free tier.

Free tier · SOC2 Type II · On-prem / VPC available