What is the cheapest LLM for RAG in 2026?

On the canonical 8K-in / 400-out RAG query, DeepSeek V4 Flash is the cheapest at sub-cent per query, working out to roughly $620/month at 500K queries. DeepSeek V4 Pro is the cheapest model with frontier-adjacent quality at roughly $7,580/month on the same workload — about 5x cheaper than GPT-5.5 and 12x cheaper than GPT-5.5 Pro.

Should I include embedding cost in my RAG budget?

Yes, but it is small. At 50 tokens per query embedded with OpenAI text-embedding-3-large ($0.13/1M tokens), 500K queries/month costs about $3.25 — rounding error compared to the LLM inference. The exception is corpus embedding: re-embedding a 100M-token knowledge base costs ~$13K with 3-large but $0 with self-hosted BGE-M3.

Which embedding model is best for RAG?

For English-heavy technical content: Voyage-3 leads on retrieval benchmarks at $0.06/1M tokens. For multilingual content: Cohere embed-multilingual-v3 at $0.10/1M tokens. For self-hosted or budget-constrained workloads: BGE-M3 is open-weight, free to host, and competitive with commercial offerings. OpenAI text-embedding-3-large remains the default for general-purpose RAG.

How does context window matter for RAG?

It matters for retrieval recall. With Gemini 3.1 Pro (2M context) you can stuff 50+ chunks into the prompt, increasing the chance the right answer is in scope. With shorter-context models you need a tighter retriever and reranker. Gemini 3.1 Pro is the natural fit when you want generous chunk budgets and acceptable cost per call.

How much can I save by caching the system prompt in RAG?

A lot. RAG calls have a stable system prompt (instructions, output format, persona) often 1-2K tokens. With Anthropic prompt caching at 90% off cached input, the system-prompt portion drops by 90%, saving roughly 10-20% of total per-call cost. OpenAI auto-caches recent prefixes for free. Both are essentially required at production RAG scale.

Cost of RAG: AI Model Pricing Compared (May 2026)

Retrieval-augmented generation is the workhorse of production LLM apps: pull relevant chunks, stuff them into the prompt, ask the model to answer. We price the canonical 8K-in / 400-out RAG query across every major LLM and embedding model.

The reference scenario

Task: RAG query: 8K input tokens (retrieved chunks + question) + 400 output tokens
Input tokens per call: 8,000 (system prompt ~1K + retrieved chunks ~6K + user query ~1K)
Output tokens per call: 400
Monthly volume: 500,000 queries (production RAG application)
Total tokens / month: 4200M

LLM cost across 10 models, sorted cheapest first

Rank	Model	Per call	Per month	vs cheapest
1	Gemini 2.0 Flash	$0.000960	$480	—
2	DeepSeek V4 Flash	$0.0012	$616	1.3x
3	Claude 3.5 Haiku	$0.0080	$4,000	8.3x
4	Qwen 3.6 Plus	$0.0134	$6,720	14.0x
5	DeepSeek V4 Pro	$0.0153	$7,656	15.9x
6	Claude Sonnet 4	$0.0300	$15,000	31.3x
7	Gemini 3.1 Pro	$0.0322	$16,100	33.5x
8	Claude Opus 4.7	$0.0500	$25,000	52.1x
9	GPT-5.5	$0.0520	$26,000	54.2x
10	GPT-5.5 Pro	$0.3120	$156,000	325.0x

Monthly LLM spend at 500K queries

Gemini 2.0 Flash       #................................... $480
DeepSeek V4 Flash      #................................... $616
Claude 3.5 Haiku       #................................... $4,000
Qwen 3.6 Plus          ##.................................. $6,720
DeepSeek V4 Pro        ##.................................. $7,656
Claude Sonnet 4        ###................................. $15,000
Gemini 3.1 Pro         ####................................ $16,100
Claude Opus 4.7        ######.............................. $25,000
GPT-5.5                ######.............................. $26,000
GPT-5.5 Pro            #################################### $156,000

Per-call cost

Gemini 2.0 Flash       #............................. $0.000960
DeepSeek V4 Flash      #............................. $0.0012
Claude 3.5 Haiku       #............................. $0.0080
Qwen 3.6 Plus          #............................. $0.0134
DeepSeek V4 Pro        #............................. $0.0153
Claude Sonnet 4        ###........................... $0.0300
Gemini 3.1 Pro         ###........................... $0.0322
Claude Opus 4.7        #####......................... $0.0500
GPT-5.5                #####......................... $0.0520
GPT-5.5 Pro            ############################## $0.3120

Embedding model pricing (separate from LLM inference)

Embeddings are billed separately. They are typically a small fraction of total RAG cost — but corpus re-embedding can be non-trivial (millions of tokens, often hundreds of dollars).

Model	Vendor	$ / 1M tokens	Notes
text-embedding-3-large	OpenAI	$0.1300	Default for new RAG builds. 3,072 dim, strong retrieval quality on English-heavy corpora.
embed-multilingual-v3	Cohere	$0.1000	100+ languages. Strong on retrieval over multilingual or non-English corpora.
voyage-3	Voyage AI	$0.0600	Anthropic-recommended embedding model. Best benchmark scores on technical / code-heavy retrieval as of mid-2026.
BGE-M3	BAAI / self-hosted	$0.000000	Open-weight. Free to self-host. Multilingual and competitive quality vs commercial embeddings.

At 50 tokens per query embedded with OpenAI text-embedding-3-large, 500K queries/month costs roughly $3.25 — negligible vs LLM inference. Corpus embedding is the bigger cost: a 100M-token knowledge base costs ~$13K with 3-large or $0 with self-hosted BGE-M3.

Which model wins for RAG?

Recommended pick: DeepSeek V4 Pro. At $1.74 / $3.48 per 1M tokens it is the cheapest model with frontier-adjacent retrieval reasoning. On the 500K-query/month workload it lands at roughly $7.5K — about 5x cheaper than GPT-5.5 and 12x cheaper than GPT-5.5 Pro for very similar answer quality on grounded RAG tasks.

Runner-up: Gemini 3.1 Pro. When the retriever is loose (you want to stuff 50+ chunks into the context), Gemini 3.1 Pro's 2M token window is a structural advantage. At $3.50 / $10.50 per 1M tokens it is the natural pick for long-context RAG. Honorable mention: GPT-5.5 Mini-tier equivalents — Claude Sonnet 4 and Qwen 3.6 Plus are both reasonable mid-tier choices.

When to use a cheap model

FAQ / knowledge-base queries with short well-defined answers
Internal employee search (not customer-facing)
High-volume product-search RAG (e-commerce style)
Workloads where the retriever does most of the work and the LLM is just a phraser
RAG over English-only or single-domain corpora

When to use a frontier model

Customer-facing RAG with reputational risk (legal, medical, financial)
Multi-step reasoning across retrieved chunks
Loose retrieval where the model needs to identify which chunks are relevant
Cross-lingual RAG (query in one language, corpus in another)
Agentic RAG with tool use and multi-turn refinement

Embedding cost is small but corpus re-embedding is not

Per-query embedding cost at production scale is rounding error — a few dollars a month at 500K queries. The trap is corpus re-embedding: every time you change embedding models or chunking strategy you re-embed the entire knowledge base. For a 100M-token corpus that is $13K with OpenAI 3-large, $6K with Voyage-3, $10K with Cohere v3, or $0 with self-hosted BGE-M3. Plan re-embedding into your budget at the same cadence as model upgrades.

Pricing data sourced from official provider pages and OpenRouter, May 2026-05-06. Embedding pricing from OpenAI, Cohere, and Voyage AI public pricing pages. BGE-M3 is open-weight under the MIT license.