Per Million Tokens — The True Cost

Headline pricing is the floor, not the bill. Six adders compound to push the effective cost 1.5-3x above list. Updated May 2026 with the latest tokenizer, batching, and tier behaviour from each provider.

Six adders that turn list price into bill price

New tokenizer (Claude Opus 4.7)

Claude Opus 4.7 ships with a new tokenizer that produces ~35% more tokens for the same input text vs Opus 4.6. List price unchanged at $5/$25 per 1M tokens — effective cost rose by a third.

+35%

System-prompt bloat

Production prompts grow over time as you add tools, examples, and guardrails. The advertised per-call cost ignores that most of your tokens are the same system prompt sent on every request — until you add prompt caching.

+18%

Tool / function-calling round-trips

Every agentic loop is multiple billed turns. A "single user request" is often 3-5 LLM calls under the hood. Per-million-tokens pricing measures one turn; production measures the whole conversation.

+25%

Rate-limit retries (429s)

Production traffic includes retries on 429s, which double-bills the input on the second attempt. Most providers return tokens on the failed request anyway — and meter them.

+8%

Output padding (chain-of-thought, JSON wrappers)

Reasoning models bill the full thinking trace at output rates. Structured output mode bills the JSON wrapper. The visible answer is often a fraction of the metered output.

+22%

Priority / SLA pricing tier

Most providers offer a "priority" or strict-SLA tier at +50% to standard list. Many enterprise procurement teams default to it without realising standard tier hits SLA 99% of the time.

+50%

Compounding the four production-baseline adders (system-prompt bloat, tool-call overhead, rate-limit retries, output padding) gives an effective multiplier of 1.94x over list. Add tokenizer drift if you switched to Claude Opus 4.7. Add priority-tier if you defaulted to it.

Agentic task with 4 tool calls

A real production agent task: user message + system prompt + tool definitions, then 4 inference rounds (one per tool call) before the final answer. List-price comparison vs effective cost with hidden adders.

ModelList $/moEffective $/moDelta
Claude Opus 4.7$2,500$6,559+162%
GPT-5.5$2,700$5,247+94%
Gemini 3.1 Pro$1,470$2,857+94%
DeepSeek V4 Pro$661$1,285+94%

RAG summary on a 30K token doc

Retrieved-augmented summarization. The "1M token context" headline matters less than the cost of stuffing 30K tokens into the prompt for every call.

ModelList $/moEffective $/moDelta
Claude Opus 4.7$1,650$4,329+162%
GPT-5.5$1,680$3,265+94%
Gemini 3.1 Pro$1,113$2,163+94%
DeepSeek V4 Pro$543$1,055+94%
DeepSeek V4 Flash$43.68$84.89+94%

Simple classification (1K in, 20 out)

A short ticket classifier. The cost ratio between cheapest and most expensive model on this workload is enormous — and the quality difference is near-zero for a well-defined task.

ModelList $/moEffective $/moDelta
Claude Opus 4.7$2,750$7,215+162%
GPT-5.5$2,800$5,442+94%
Gemini 3.1 Pro$1,855$3,605+94%
DeepSeek V4 Pro$905$1,758+94%
DeepSeek V4 Flash$72.80$141+94%

What to actually do about the gap

The bill-vs-list gap is not a vendor conspiracy. It is the difference between describing one inference call and describing a production workload. Closing the gap is mechanical, not magical:

  1. Measure your effective rate, not your list rate. Bill ÷ external prompts is the only number that matters at quarter-end.
  2. Adopt prompt caching everywhere. Anthropic offers 90% off cached input. OpenAI auto-caches recently-seen prefixes. Gemini caches at API level. The system prompt is the obvious cache target — it is the same on every call.
  3. Default to standard tier; opt into priority on a per-route basis. Most workloads don't need the SLA. The +50% surcharge compounds.
  4. Cascade and route. Push the trivial 60% of traffic to a cheaper model — DeepSeek V4 Flash or Gemini 3.1 Pro. The quality delta on simple work is rounding error; the cost delta is 10-200x. See our Mixture-of-Routers deep-dive and Model-Mixing Cost Savings calculator.
  5. Re-tokenize when providers re-tokenize. Claude Opus 4.7's new tokenizer is a 35% adder hidden in plain sight. Re-baseline your eval and your bill the day a provider ships a new model family.

Related

Sources: official provider pricing pages (May 2026-05-06), Artificial Analysis production-cost benchmarks, and Swfte Connect telemetry on a representative SaaS workload mix.