Cheapest LLM for Vision (May 2026)

Multimodal models that accept image input, ranked by output token price. Quality scores from 2026-05-06 document, chart, and scene evals.

Vision capability means a model accepts image input alongside text. The use cases split into roughly four buckets: OCR (extracting text from images), document understanding (charts, tables, scanned forms), scene understanding (what is in this photo), and UI understanding (screenshots, layouts). Each bucket has different quality leaders.

Cheap matters because image tokens are expensive per pixel and add up fast. A single 4K product photo can cost more than 5,000 text tokens of input. High-volume pipelines — receipt OCR, moderation, image-to-caption — burn budget on input even before the model emits a single output token. Picking the cheapest model that still hits your accuracy bar is the only way to make the unit economics work at scale.

Ranking — cheapest first

ModelInput / 1MOutput / 1MQualityNotes
Gemini 2.0 Flash$0.07$0.3082/100Cheapest multimodal in production. Strong on simple OCR and tagging.
DeepSeek V4 Flash$0.14$0.2878/100Vision works for basic scenes; weak on dense documents.
Nemotron 3 Nano OmniFreeFree80/100Self-host. Vision + audio + text in one stack. Open weights.
Qwen 3.6 Plus$1.40$5.6085/100Strong on document understanding; good multilingual OCR.
DeepSeek V4 Pro$1.74$3.4888/100Solid vision at frontier-adjacent quality, fraction of the price.
Gemini 3.1 Pro$3.50$10.5094/100Best value for charts, diagrams, and long-form image+text reasoning.
GPT-5.5$5.00$30.0093/100Strong all-round vision; best for UI screenshots.
Claude Opus 4.7$5.00$25.0096/100Best on dense documents — tables, charts, scanned PDFs.

Cost visualised

Cost per 1M output tokens (lower = cheaper)
Gemini 2.0 Flash       # $0.30
DeepSeek V4 Flash      # $0.28
Qwen 3.6 Plus          ####### $5.60
DeepSeek V4 Pro        ##### $3.48
Gemini 3.1 Pro         ############## $10.50
GPT-5.5                ######################################## $30.00
Claude Opus 4.7        ################################# $25.00

The winner

Gemini 2.0 Flash

Gemini 2.0 Flash at $0.075 in / $0.30 out per 1M tokens is the cheapest multimodal endpoint in production, and quality on everyday vision tasks (tagging, simple OCR, scene understanding) is more than acceptable. For document-heavy work with dense tables or charts, escalate to Gemini 3.1 Pro or Claude Opus 4.7 — but for high-volume image classification, moderation, and basic captioning, Flash is the right default.

Honourable mentions

  • Gemini 3.1 Pro — best value when image quality matters and 2M context is useful for multi-image batches.
  • Claude Opus 4.7 — best for dense document understanding (charts, tables, scanned PDFs). Pay the premium when quality is non-negotiable.
  • Nemotron 3 Nano Omni 30B (self-host) — vision + audio + text in one stack. Permissive license. Right call when you need full modality coverage at infra-only cost.

When to upgrade to a frontier model

  • You need to read tables, charts, or financial documents accurately.
  • Scanned-document OCR accuracy on the cheap tier drops below 92%.
  • Workload involves cross-image reasoning (compare A vs B).
  • You need fine-grained UI element identification (selectors, click targets).
  • Cost of a misread (medical, legal, financial) exceeds the per-call price gap.

FAQ

What is the cheapest free option for vision?

Self-host Nemotron 3 Nano Omni 30B (vision + audio + text, open weights, permissive license). Llava-Next is a lighter alternative if you only need image-to-text. Quality is competitive with mid-tier paid APIs on most natural scenes.

What is the cheapest model with vision API?

Gemini 2.0 Flash at $0.075 in / $0.30 out per 1M tokens, image input included. For document-heavy work, DeepSeek V4 Flash at $0.14/$0.28 is a hair more expensive but has better OCR.

What is the cheapest open-weight vision option?

Nemotron 3 Nano Omni 30B for the broadest modality coverage; Llava-Next for narrower image-only work; or Qwen 3.6 VL self-hosted if you have the GPU budget.

What is the cheapest model for production vision?

Gemini 3.1 Pro at $3.50 in / $10.50 out is the sweet spot — 2M context for multi-image batches, strong vision quality across charts/tables/scenes, and pricing well below the OpenAI/Anthropic frontier tier.

What should I watch out for?

Image tokens add up fast. A single 1024x1024 image is ~1300 tokens on most providers; a 4K image is 5000+. High-volume vision pipelines often pay more in image tokens than text. Always check the provider-specific image-token formula before estimating cost.

Related

All prices from official provider pages and OpenRouter as of 2026-05-06. Vision quality scores from internal Swfte evals on document, chart, and scene benchmarks across 32 catalogued models.