Cost of RAG: AI Model Pricing Compared (May 2026)

Retrieval-augmented generation is the workhorse of production LLM apps: pull relevant chunks, stuff them into the prompt, ask the model to answer. We price the canonical 8K-in / 400-out RAG query across every major LLM and embedding model.

The reference scenario

  • Task: RAG query: 8K input tokens (retrieved chunks + question) + 400 output tokens
  • Input tokens per call: 8,000 (system prompt ~1K + retrieved chunks ~6K + user query ~1K)
  • Output tokens per call: 400
  • Monthly volume: 500,000 queries (production RAG application)
  • Total tokens / month: 4200M

LLM cost across 10 models, sorted cheapest first

RankModelPer callPer monthvs cheapest
1Gemini 2.0 Flash$0.000960$480
2DeepSeek V4 Flash$0.0012$6161.3x
3Claude 3.5 Haiku$0.0080$4,0008.3x
4Qwen 3.6 Plus$0.0134$6,72014.0x
5DeepSeek V4 Pro$0.0153$7,65615.9x
6Claude Sonnet 4$0.0300$15,00031.3x
7Gemini 3.1 Pro$0.0322$16,10033.5x
8Claude Opus 4.7$0.0500$25,00052.1x
9GPT-5.5$0.0520$26,00054.2x
10GPT-5.5 Pro$0.3120$156,000325.0x

Monthly LLM spend at 500K queries

Gemini 2.0 Flash       #................................... $480
DeepSeek V4 Flash      #................................... $616
Claude 3.5 Haiku       #................................... $4,000
Qwen 3.6 Plus          ##.................................. $6,720
DeepSeek V4 Pro        ##.................................. $7,656
Claude Sonnet 4        ###................................. $15,000
Gemini 3.1 Pro         ####................................ $16,100
Claude Opus 4.7        ######.............................. $25,000
GPT-5.5                ######.............................. $26,000
GPT-5.5 Pro            #################################### $156,000

Per-call cost

Gemini 2.0 Flash       #............................. $0.000960
DeepSeek V4 Flash      #............................. $0.0012
Claude 3.5 Haiku       #............................. $0.0080
Qwen 3.6 Plus          #............................. $0.0134
DeepSeek V4 Pro        #............................. $0.0153
Claude Sonnet 4        ###........................... $0.0300
Gemini 3.1 Pro         ###........................... $0.0322
Claude Opus 4.7        #####......................... $0.0500
GPT-5.5                #####......................... $0.0520
GPT-5.5 Pro            ############################## $0.3120

Embedding model pricing (separate from LLM inference)

Embeddings are billed separately. They are typically a small fraction of total RAG cost — but corpus re-embedding can be non-trivial (millions of tokens, often hundreds of dollars).

ModelVendor$ / 1M tokensNotes
text-embedding-3-largeOpenAI$0.1300Default for new RAG builds. 3,072 dim, strong retrieval quality on English-heavy corpora.
embed-multilingual-v3Cohere$0.1000100+ languages. Strong on retrieval over multilingual or non-English corpora.
voyage-3Voyage AI$0.0600Anthropic-recommended embedding model. Best benchmark scores on technical / code-heavy retrieval as of mid-2026.
BGE-M3BAAI / self-hosted$0.000000Open-weight. Free to self-host. Multilingual and competitive quality vs commercial embeddings.

At 50 tokens per query embedded with OpenAI text-embedding-3-large, 500K queries/month costs roughly $3.25 — negligible vs LLM inference. Corpus embedding is the bigger cost: a 100M-token knowledge base costs ~$13K with 3-large or $0 with self-hosted BGE-M3.

Which model wins for RAG?

Recommended pick: DeepSeek V4 Pro. At $1.74 / $3.48 per 1M tokens it is the cheapest model with frontier-adjacent retrieval reasoning. On the 500K-query/month workload it lands at roughly $7.5K — about 5x cheaper than GPT-5.5 and 12x cheaper than GPT-5.5 Pro for very similar answer quality on grounded RAG tasks.

Runner-up: Gemini 3.1 Pro. When the retriever is loose (you want to stuff 50+ chunks into the context), Gemini 3.1 Pro's 2M token window is a structural advantage. At $3.50 / $10.50 per 1M tokens it is the natural pick for long-context RAG. Honorable mention: GPT-5.5 Mini-tier equivalents — Claude Sonnet 4 and Qwen 3.6 Plus are both reasonable mid-tier choices.

When to use a cheap model

  • FAQ / knowledge-base queries with short well-defined answers
  • Internal employee search (not customer-facing)
  • High-volume product-search RAG (e-commerce style)
  • Workloads where the retriever does most of the work and the LLM is just a phraser
  • RAG over English-only or single-domain corpora

When to use a frontier model

  • Customer-facing RAG with reputational risk (legal, medical, financial)
  • Multi-step reasoning across retrieved chunks
  • Loose retrieval where the model needs to identify which chunks are relevant
  • Cross-lingual RAG (query in one language, corpus in another)
  • Agentic RAG with tool use and multi-turn refinement

Embedding cost is small but corpus re-embedding is not

Per-query embedding cost at production scale is rounding error — a few dollars a month at 500K queries. The trap is corpus re-embedding: every time you change embedding models or chunking strategy you re-embed the entire knowledge base. For a 100M-token corpus that is $13K with OpenAI 3-large, $6K with Voyage-3, $10K with Cohere v3, or $0 with self-hosted BGE-M3. Plan re-embedding into your budget at the same cadence as model upgrades.

Related

Pricing data sourced from official provider pages and OpenRouter, May 2026-05-06. Embedding pricing from OpenAI, Cohere, and Voyage AI public pricing pages. BGE-M3 is open-weight under the MIT license.