Cost of RAG: AI Model Pricing Compared (May 2026)
Retrieval-augmented generation is the workhorse of production LLM apps: pull relevant chunks, stuff them into the prompt, ask the model to answer. We price the canonical 8K-in / 400-out RAG query across every major LLM and embedding model.
The reference scenario
- Task: RAG query: 8K input tokens (retrieved chunks + question) + 400 output tokens
- Input tokens per call: 8,000 (system prompt ~1K + retrieved chunks ~6K + user query ~1K)
- Output tokens per call: 400
- Monthly volume: 500,000 queries (production RAG application)
- Total tokens / month: 4200M
LLM cost across 10 models, sorted cheapest first
| Rank | Model | Per call | Per month | vs cheapest |
|---|---|---|---|---|
| 1 | Gemini 2.0 Flash | $0.000960 | $480 | — |
| 2 | DeepSeek V4 Flash | $0.0012 | $616 | 1.3x |
| 3 | Claude 3.5 Haiku | $0.0080 | $4,000 | 8.3x |
| 4 | Qwen 3.6 Plus | $0.0134 | $6,720 | 14.0x |
| 5 | DeepSeek V4 Pro | $0.0153 | $7,656 | 15.9x |
| 6 | Claude Sonnet 4 | $0.0300 | $15,000 | 31.3x |
| 7 | Gemini 3.1 Pro | $0.0322 | $16,100 | 33.5x |
| 8 | Claude Opus 4.7 | $0.0500 | $25,000 | 52.1x |
| 9 | GPT-5.5 | $0.0520 | $26,000 | 54.2x |
| 10 | GPT-5.5 Pro | $0.3120 | $156,000 | 325.0x |
Monthly LLM spend at 500K queries
Gemini 2.0 Flash #................................... $480 DeepSeek V4 Flash #................................... $616 Claude 3.5 Haiku #................................... $4,000 Qwen 3.6 Plus ##.................................. $6,720 DeepSeek V4 Pro ##.................................. $7,656 Claude Sonnet 4 ###................................. $15,000 Gemini 3.1 Pro ####................................ $16,100 Claude Opus 4.7 ######.............................. $25,000 GPT-5.5 ######.............................. $26,000 GPT-5.5 Pro #################################### $156,000
Per-call cost
Gemini 2.0 Flash #............................. $0.000960 DeepSeek V4 Flash #............................. $0.0012 Claude 3.5 Haiku #............................. $0.0080 Qwen 3.6 Plus #............................. $0.0134 DeepSeek V4 Pro #............................. $0.0153 Claude Sonnet 4 ###........................... $0.0300 Gemini 3.1 Pro ###........................... $0.0322 Claude Opus 4.7 #####......................... $0.0500 GPT-5.5 #####......................... $0.0520 GPT-5.5 Pro ############################## $0.3120
Embedding model pricing (separate from LLM inference)
Embeddings are billed separately. They are typically a small fraction of total RAG cost — but corpus re-embedding can be non-trivial (millions of tokens, often hundreds of dollars).
| Model | Vendor | $ / 1M tokens | Notes |
|---|---|---|---|
| text-embedding-3-large | OpenAI | $0.1300 | Default for new RAG builds. 3,072 dim, strong retrieval quality on English-heavy corpora. |
| embed-multilingual-v3 | Cohere | $0.1000 | 100+ languages. Strong on retrieval over multilingual or non-English corpora. |
| voyage-3 | Voyage AI | $0.0600 | Anthropic-recommended embedding model. Best benchmark scores on technical / code-heavy retrieval as of mid-2026. |
| BGE-M3 | BAAI / self-hosted | $0.000000 | Open-weight. Free to self-host. Multilingual and competitive quality vs commercial embeddings. |
At 50 tokens per query embedded with OpenAI text-embedding-3-large, 500K queries/month costs roughly $3.25 — negligible vs LLM inference. Corpus embedding is the bigger cost: a 100M-token knowledge base costs ~$13K with 3-large or $0 with self-hosted BGE-M3.
Which model wins for RAG?
Recommended pick: DeepSeek V4 Pro. At $1.74 / $3.48 per 1M tokens it is the cheapest model with frontier-adjacent retrieval reasoning. On the 500K-query/month workload it lands at roughly $7.5K — about 5x cheaper than GPT-5.5 and 12x cheaper than GPT-5.5 Pro for very similar answer quality on grounded RAG tasks.
Runner-up: Gemini 3.1 Pro. When the retriever is loose (you want to stuff 50+ chunks into the context), Gemini 3.1 Pro's 2M token window is a structural advantage. At $3.50 / $10.50 per 1M tokens it is the natural pick for long-context RAG. Honorable mention: GPT-5.5 Mini-tier equivalents — Claude Sonnet 4 and Qwen 3.6 Plus are both reasonable mid-tier choices.
When to use a cheap model
- FAQ / knowledge-base queries with short well-defined answers
- Internal employee search (not customer-facing)
- High-volume product-search RAG (e-commerce style)
- Workloads where the retriever does most of the work and the LLM is just a phraser
- RAG over English-only or single-domain corpora
When to use a frontier model
- Customer-facing RAG with reputational risk (legal, medical, financial)
- Multi-step reasoning across retrieved chunks
- Loose retrieval where the model needs to identify which chunks are relevant
- Cross-lingual RAG (query in one language, corpus in another)
- Agentic RAG with tool use and multi-turn refinement
Embedding cost is small but corpus re-embedding is not
Per-query embedding cost at production scale is rounding error — a few dollars a month at 500K queries. The trap is corpus re-embedding: every time you change embedding models or chunking strategy you re-embed the entire knowledge base. For a 100M-token corpus that is $13K with OpenAI 3-large, $6K with Voyage-3, $10K with Cohere v3, or $0 with self-hosted BGE-M3. Plan re-embedding into your budget at the same cadence as model upgrades.
Related
- Token Cost Calculator
- Cheap vs Expensive Model Comparison
- Model-Mixing Cost Savings
- AI Model Leaderboard
Pricing data sourced from official provider pages and OpenRouter, May 2026-05-06. Embedding pricing from OpenAI, Cohere, and Voyage AI public pricing pages. BGE-M3 is open-weight under the MIT license.