What is the cheapest model for document summarization?

DeepSeek V4 Flash and Gemini 2.0 Flash are the cheapest at sub-cent per summary on a 30K token input. For higher quality at low cost, DeepSeek V4 Pro is roughly 10x cheaper than GPT-5.5 with comparable summary fidelity on long documents.

Which model produces the best summaries in 2026?

For long documents (20K+ tokens), Gemini 3.1 Pro and Claude Opus 4.7 are the consistent leaders on faithfulness and structure. Gemini 3.1 Pro wins on raw context length (2M tokens) while Opus 4.7 wins on stylistic control and instruction following.

How much does it cost to summarize 5,000 documents per month?

On the 30K-in / 600-out scenario, DeepSeek V4 Flash costs roughly $22/month, DeepSeek V4 Pro about $271/month, Gemini 3.1 Pro about $556/month, Claude Opus 4.7 about $825/month, and GPT-5.5 Pro about $4,680/month. Same workload, 200x spread.

Does prompt caching help summarization workloads?

It helps when you re-summarize the same source document with different prompts, or when a long shared system prompt drives every call. For one-off summaries of unique documents, caching offers little — the input changes every call. Batch tier (50% off on most providers) is usually the bigger lever.

Should I chunk and map-reduce, or send the whole document?

In 2026 the default is no longer to chunk. Frontier 1M+ context windows handle a 30K document trivially and produce more coherent summaries than map-reduce. Chunk only when (a) a document exceeds the model context, or (b) you need parallelism for latency reasons.

Cost of Summarization: AI Model Pricing Compared (May 2026)

Summarization is the canonical long-context task: a knowledge platform takes a 30-page document and condenses it into an executive brief. We price that exact workload across every major LLM using May 2026 list pricing.

The reference scenario

Task: Summarize a 30K token document into a 600 token executive summary
Input tokens per call: 30,000
Output tokens per call: 600
Monthly volume: 5,000 summaries (mid-market knowledge platform)
Total tokens / month: 153M

Excludes prompt caching, batch discounts, and committed-use deals. Self-hosted open-weight models excluded.

Cost across 10 models, sorted cheapest first

Rank	Model	Per call	Per month	vs cheapest
1	Gemini 2.0 Flash	$0.0032	$16.20	—
2	DeepSeek V4 Flash	$0.0044	$21.84	1.3x
3	Claude 3.5 Haiku	$0.0264	$132	8.1x
4	Qwen 3.6 Plus	$0.0454	$227	14.0x
5	DeepSeek V4 Pro	$0.0543	$271	16.8x
6	Claude Sonnet 4	$0.0990	$495	30.6x
7	Gemini 3.1 Pro	$0.1113	$557	34.4x
8	Claude Opus 4.7	$0.1650	$825	50.9x
9	GPT-5.5	$0.1680	$840	51.9x
10	GPT-5.5 Pro	$1.01	$5,040	311.1x

Monthly spend at 5K summaries

Gemini 2.0 Flash       #................................... $16.20
DeepSeek V4 Flash      #................................... $21.84
Claude 3.5 Haiku       #................................... $132
Qwen 3.6 Plus          ##.................................. $227
DeepSeek V4 Pro        ##.................................. $271
Claude Sonnet 4        ####................................ $495
Gemini 3.1 Pro         ####................................ $557
Claude Opus 4.7        ######.............................. $825
GPT-5.5                ######.............................. $840
GPT-5.5 Pro            #################################### $5,040

Per-call cost (cents)

Gemini 2.0 Flash       #............................. $0.0032
DeepSeek V4 Flash      #............................. $0.0044
Claude 3.5 Haiku       #............................. $0.0264
Qwen 3.6 Plus          #............................. $0.0454
DeepSeek V4 Pro        ##............................ $0.0543
Claude Sonnet 4        ###........................... $0.0990
Gemini 3.1 Pro         ###........................... $0.1113
Claude Opus 4.7        #####......................... $0.1650
GPT-5.5                #####......................... $0.1680
GPT-5.5 Pro            ############################## $1.01

Which model wins for summarization?

For accurate summaries: Gemini 3.1 Pro is our recommended pick. The 2M context window handles 30-page documents without chunking, the multilingual fidelity is best-in-class, and at $3.50 / $10.50 per 1M tokens it sits in the middle of the pricing band — not the cheapest, but the best quality-per-dollar at long context. Runner-up: Claude Opus 4.7, which produces tighter, more stylistically-controlled prose at roughly 50% higher per-token cost.

For high-volume summarization: DeepSeek V4 Pro is the right default. It is roughly 10x cheaper than GPT-5.5 and 4x cheaper than Gemini 3.1 Pro, with summary quality close enough to frontier on standard documents that the gap rarely shows up in human evaluation. Use DeepSeek V4 Flash for the trivial 60-70% of traffic where the document is short and the summary need is shallow — sub-cent per call at any reasonable volume.

When to use a cheap model

Short or templated documents (under 5K tokens)
Internal-only summaries that humans will read once
Bulk pre-processing for downstream tasks (classification, retrieval)
Languages well-represented in the model's training (English, Mandarin, Spanish)
Workloads where you can A/B-test and accept a small quality drop

When to use a frontier model

Customer-facing summaries (legal, financial, medical)
Long documents (50K+ tokens) where coherence matters
Multilingual or low-resource languages
Summaries that drive a downstream decision (board memos, M&A briefs)
When the cost of a wrong summary >> cost of inference

The cascade pattern saves 70-80%

Most teams running production summarization at scale do not pick one model. They route: cheap model first, then a quality-check classifier decides whether to escalate to a frontier model. On a 5K/month workload the cascade typically delivers 90% of frontier-quality summaries at 20-25% of the frontier cost. We built a calculator for that on the Model-Mixing Cost Savings page.

Pricing data sourced from official provider pages and OpenRouter, May 2026-05-06. Effective production cost will be 1.5-3x higher once you account for system prompts, retries, and priority-tier surcharges — see our per-million-tokens true cost breakdown.