Updated May 15, 2026 · 7 min read

Gemini API pricing (May 2026)

TL;DR: Gemini is the cheapest frontier line in 2026: Gemini 3.0 at $1.25 / $3.75 is ~4× cheaper than GPT-5.5 and ~2× cheaper than Claude Sonnet 4. Gemini 3.1 Pro leads science reasoning (GPQA Diamond 94.3%) and offers a 2M-token context window. Gemini 2.5 Flash at $0.075 input is the cheapest mainstream tier across all frontier providers.

Every Gemini model and its per-token price

ModelInput / 1MCached input / 1MOutput / 1MContextNotes
Gemini 3.1 Pro$3.50$0.875$10.502MScience + multimodal leader. GPQA Diamond 94.3%. 2M tokens of context.
Gemini 3.0$1.25$0.313$3.751MDefault tier. Strong general-purpose with long context.
Gemini 2.5 Pro$1.25$0.313$5.001MLegacy generation but still widely used.
Gemini 2.5 Flash$0.07$0.019$0.301MCheapest Gemini tier. Excellent for high-volume classification + extraction.

All prices in USD per 1 million tokens. Last reviewed 2026-05-15. Provider pricing pages are authoritative, confirm before contracting.

How Gemini pricing actually works

Gemini API pricing follows Google’s usual platform shape; a free / cheap tier on AI Studio for prototyping, a production tier on Vertex AI with the same per-token rates plus enterprise governance, and explicit context caching as a first-class primitive. The Gemini 3 family stretches the price range from $0.075 input (Flash) to $3.50 input (3.1 Pro), and a 47× spread between cheapest and flagship.

Gemini is the price leader for high-volume production workloads. document processing, classification, extraction, and multimodal pipelines. The 2M context window makes it the default choice for whole-codebase or whole-document-collection analysis. Google’s integration with the rest of GCP (BigQuery, Vertex AI Workbench, Cloud Storage) makes it the natural pick for organisations already on Google Cloud.

Google offers context caching, batch mode (50% off), and Vertex AI for VPC deployment. Free tier on AI Studio for prototyping.

Prompt caching: the 90% discount most teams ignore

Google’s context caching is explicit, you create a cached_content resource server-side and reference it on subsequent requests. The cached tokens are served at 25% of the standard input rate, with a per-hour storage fee that’s typically negligible. Right fit for: RAG sessions over a stable corpus, multi-turn codebase analysis, and any workload where the same large prefix is reused for tens of minutes.

On a typical coding agent run that re-sends the same 200K-token codebase across 10 turns, prompt caching reduces effective input cost by 80-90%. The cached-input column in the table above is the right number to plug into a production budget; the headline input rate is the "new conversation" rate, not the steady-state rate.

Batch inference, and half-price overnight

Vertex AI Batch Predictions ships at 50% off both input and output with a 24-hour SLA. Best for nightly document enrichment, evaluation runs, and back-office processing. Batch + cache combine well: a cached, batched call on Gemini 2.5 Flash can land at <$0.05 per 1M effective input tokens.

Batch + cache stack. The combined effective rate for a cache-warm, batched call is often 5-10% of the headline price. For workloads like nightly eval suites, large-scale classification, document enrichment, and synthetic data generation, batching is free money.

Four real production cost scenarios

WorkloadDetailHeadline costWith cacheWith batch
Chat (Gemini 3.0)1M tokens in, 100K tokens out$1.25 + $0.375 = $1.625$0.3125 + $0.375 = $0.6875$0.625 + $0.1875 = $0.8125
Long-context RAG (Gemini 3.1 Pro)1.5M context in, 50K out$5.25 + $0.525 = $5.775$1.31 + $0.525 = $1.84N/A, interactive
Multimodal (Gemini 3.1 Pro)500K text + 100 images, 10K out~$2.40~$0.80~$1.20
High-volume classification (Gemini 2.5 Flash)100M in, 5M out$7.50 + $1.50 = $9.00$1.875 + $1.50 = $3.375$3.75 + $0.75 = $4.50

The routing pattern that cuts Gemini spend 60-80%

Common production pattern: default to Gemini 2.5 Flash or Gemini 3.0 for cost-sensitive bulk traffic, promote to Gemini 3.1 Pro for multimodal or long-context turns, fall back to Claude Sonnet 4 for coding-specific intents where Claude has the quality edge. A gateway in front handles the routing decision on a per-request basis.

A typical production fleet settles into a 70/25/5 split. 70% of requests handled by the smallest competent tier, 25% by the mid-tier workhorse, 5% promoted to the flagship. Done well, this cuts model spend 60-80% versus naive single-model use without any measurable quality drop on the bulk of requests.

With an AI gateway in front, the routing rule is one config block: declare a default model, declare promotion triggers, declare a fallback to a second provider for availability. Applications keep using a single OpenAI-compatible endpoint. See Swfte for a managed runtime that bundles the gateway, observability, eval, and per-team cost ceilings.

Enterprise considerations

Vertex AI is the enterprise surface: VPC Service Controls, customer-managed encryption keys, IAM, audit logs via Cloud Audit, and integration with the rest of GCP’s data stack. Pricing matches AI Studio per-token. For air-gapped deployments, Google does not currently offer fully self-hosted Gemini; the closest equivalent is Gemma 3 open weights, which are downloadable for local inference but not the same model as Gemini.

  • Prompt caching: Available use it from day one; the headline rate is misleading without it.
  • Batch inference: Available, 50% discount, up to 24h SLA.
  • Fine-tuning: Supported on most tiers.
  • On-prem / VPC: Available via Bedrock / Vertex / Azure or direct VPC contract.
  • Zero data retention: Available; default on enterprise contracts.

How Gemini compares to the rest of the market

Against OpenAI, Gemini wins on raw price and on long-context / multimodal workloads, and GPT-5.5 has the reasoning edge but Gemini is 4× cheaper for general use. Against Claude, Gemini is cheaper and has a larger context window, but Claude leads coding. Against DeepSeek, Gemini Flash is competitive on price while DeepSeek wins on open-weights sovereignty. The standard combination for cost-optimised production: Gemini 2.5 Flash for bulk, Gemini 3.1 Pro for hard turns, Claude for code, GPT-5.5 for voice / images.

For a full side-by-side, see the API pricing index and the AI model leaderboard for quality / speed / value rankings.

Frequently asked questions about Gemini API pricing

What is Gemini API pricing in 2026?

Gemini 3.1 Pro is $3.50 per 1M input tokens and $10.50 per 1M output tokens, with a 2M-token context window. Gemini 3.0 is $1.25 / $3.75. Gemini 2.5 Flash, the cheap tier, is $0.075 / $0.30. Context caching is available at 25% of input rate.

Is Gemini cheaper than GPT-5.5 or Claude?

Yes by a wide margin. Gemini 3.0 ($1.25 / $3.75) is roughly 4× cheaper than GPT-5.5 ($5 / $30) and ~2× cheaper than Claude Sonnet 4 ($3 / $15) at comparable quality on most general workloads. Gemini 2.5 Flash at $0.075 / $0.30 is the cheapest mainstream tier among the frontier providers.

How does the 2M context window pricing work?

Gemini 3.1 Pro and Gemini 3.0 both support 2M tokens of context. by far the largest in the market. Pricing is per-token throughout, with no surcharge for tokens beyond a threshold (unlike earlier Gemini 1.5 Pro pricing). Context caching is available for the long stable prefix.

What is Gemini 3.1 Pro best for?

Multimodal and scientific reasoning. Gemini 3.1 Pro leads GPQA Diamond at 94.3%, the #1 globally on PhD-level science questions: and excels at image + video understanding. For pure language coding and writing, Claude and GPT-5.5 typically edge ahead.

How does context caching work on Gemini?

Google's context caching is explicit; you create a cached_content resource and pass its ID on subsequent requests. The cached tokens cost 25% of the standard input rate while cached, plus a per-hour storage fee. Best fit for RAG over a stable corpus or codebase analysis sessions that run for tens of minutes.

Does Gemini support batch inference?

Yes. Vertex AI Batch Predictions support 50% off pricing with a 24-hour SLA. Best for high-volume document processing and evaluation runs.

Can I fine-tune Gemini?

Yes, on Vertex AI. Both supervised fine-tuning and adapter-based fine-tuning are supported on Gemini 2.5 Flash and several smaller variants. Pricing follows standard Vertex pricing for training compute plus a small premium on tuned-model inference.

AI Studio vs Vertex AI pricing?

AI Studio is the free / cheap tier with rate limits. best for prototyping. Vertex AI is the production tier with VPC residency, audit logs, enterprise SLAs, and IAM integration. Per-token pricing is identical between the two for paid usage; Vertex adds the enterprise wrapper.

Run Gemini on a gateway you control

Swfte routes traffic across every major provider, enforces prompt caching, applies per-team budgets, and logs every request for audit. OpenAI-compatible API. Free tier.

Free tier · SOC2 Type II · On-prem / VPC available