Updated May 15, 2026 · 8 min read

DeepSeek API pricing (May 2026)

TL;DR: DeepSeek is the open-weights frontier line; Apache 2.0 licensed, fully self-hostable, and ~1/8 the price of GPT-5.5 Pro on the hosted API at similar Arena Elo. V4 Pro is $1.74 / $3.48 per 1M tokens. V4 Flash at $0.14 input is approximately the cheapest serious LLM on the market.

Every DeepSeek model and its per-token price

ModelInput / 1MCached input / 1MOutput / 1MContextNotes
DeepSeek V4 Pro$1.74$0.174$3.48256KOpen-weights flagship. Arena 1462 — within 20 Elo of GPT-5.5 at 1/8 the price.
DeepSeek V4$0.50$0.050$1.50128KDefault tier. Excellent value-per-token.
DeepSeek V4 Flash$0.14$0.014$0.28128KCheapest tier. Strong on classification + extraction.
DeepSeek R1 (reasoning)$0.55$0.055$2.19128KReasoning model with open weights. Competitive with o3 on math/code.

All prices in USD per 1 million tokens. Last reviewed 2026-05-15. Provider pricing pages are authoritative, confirm before contracting.

How DeepSeek pricing actually works

DeepSeek API pricing in 2026 is structured for two distinct users, and the hosted endpoint targets developers who want frontier quality at fraction of the cost of GPT-5.5 / Claude, while the open weights target enterprises that need full sovereignty over the model and training data. The four-tier line spans 25× from V4 Flash ($0.14 input) to V4 Pro ($1.74 input).

DeepSeek adoption falls into three clear segments. Cost-sensitive production teams use the hosted V4 endpoint as the default model for bulk workloads. chat, classification, RAG, extraction, and route the hardest 10-20% to Claude or GPT-5.5 via a gateway. Sovereignty-sensitive enterprises (defence, regulated finance, healthcare, government) self-host the Apache 2.0 weights to keep training data and inference inside their network. Research teams fine-tune the open weights on domain corpora for specialised assistants.

DeepSeek is open-weights under Apache 2.0 — fully self-hostable for sovereignty workloads. Cache discount is 90%.

Prompt caching: the 90% discount most teams ignore

DeepSeek’s prompt cache discount matches Anthropic at 90% off. On the hosted API the cache is automatic and operates on prefix matching with a 5-minute TTL. On self-hosted deployments, caching is implemented in the serving framework (vLLM, SGLang, TRT-LLM, Tensorzero) and can be tuned to your workload shape. Combined with the already low headline price, cached DeepSeek V4 lands at effectively $0.05 per 1M input: cheap enough that cost rarely binds production decisions.

On a typical coding agent run that re-sends the same 200K-token codebase across 10 turns, prompt caching reduces effective input cost by 80-90%. The cached-input column in the table above is the right number to plug into a production budget; the headline input rate is the "new conversation" rate, not the steady-state rate.

Batch inference, and half-price overnight

No hosted batch tier today. For batch workloads, the standard path is self-hosting on rented GPUs; DeepSeek V4 runs on 4× 8-GPU H100 nodes ($40-80/hr) and handles 100M+ tokens/day at full utilisation. Above ~70% sustained utilisation, self-hosting beats hosted batch on cost by 3-5×.

Batch + cache stack. The combined effective rate for a cache-warm, batched call is often 5-10% of the headline price. For workloads like nightly eval suites, large-scale classification, document enrichment, and synthetic data generation, batching is free money.

Four real production cost scenarios

WorkloadDetailHeadline costWith cacheWith batch
Chat (DeepSeek V4)1M tokens in, 100K tokens out$0.50 + $0.15 = $0.65$0.05 + $0.15 = $0.20N/A
Coding agent (DeepSeek V4 Pro)200K codebase × 10 turns × 2K out$3.48 + $0.07 = $3.55$0.35 + $0.07 = $0.42N/A
High-volume extraction (DeepSeek V4 Flash)100M in, 5M out$14.00 + $1.40 = $15.40$1.40 + $1.40 = $2.80N/A
Reasoning (DeepSeek R1)500K in, 50K out$0.275 + $0.11 = $0.385$0.028 + $0.11 = $0.138N/A

The routing pattern that cuts DeepSeek spend 60-80%

The dominant pattern for DeepSeek in production: use it as the default workhorse, route to Claude or GPT-5.5 only for the hardest 10-20% of requests. A typical fleet runs 80% DeepSeek V4 / 15% Claude Sonnet 4 / 5% Claude Opus 4.7, and total model spend drops 60-80% versus an all-Claude or all-GPT deployment with imperceptible quality drop on the bulk of traffic.

A typical production fleet settles into a 70/25/5 split. 70% of requests handled by the smallest competent tier, 25% by the mid-tier workhorse, 5% promoted to the flagship. Done well, this cuts model spend 60-80% versus naive single-model use without any measurable quality drop on the bulk of requests.

With an AI gateway in front, the routing rule is one config block: declare a default model, declare promotion triggers, declare a fallback to a second provider for availability. Applications keep using a single OpenAI-compatible endpoint. See Swfte for a managed runtime that bundles the gateway, observability, eval, and per-team cost ceilings.

Enterprise considerations

Enterprise paths for DeepSeek diverge sharply from the hosted Chinese-jurisdiction API. Most regulated enterprises use one of three options. (1) Self-host the open weights inside their VPC or on-prem on rented or owned GPUs. (2) Use a SOC2-certified third-party host. Together AI, Fireworks AI, DeepInfra, Replicate, that wraps DeepSeek in standard US/EU compliance posture. (3) Run DeepSeek through Anthropic’s, AWS Bedrock’s, or Together’s gateway alongside other providers under one contract. Direct use of the hosted DeepSeek API is rare in regulated industries.

  • Prompt caching: Available use it from day one; the headline rate is misleading without it.
  • Batch inference: Not available today.
  • Fine-tuning: Supported on most tiers.
  • On-prem / VPC: Available via Bedrock / Vertex / Azure or direct VPC contract.
  • Zero data retention: Available; default on enterprise contracts.

How DeepSeek compares to the rest of the market

Against OpenAI, DeepSeek V4 Pro is ~1/8 the price of GPT-5.5 Pro at within 20 Arena Elo points: usually the right pick for bulk production. GPT-5.5 retains the edge on hardest reasoning, voice, image generation, and enterprise procurement. Against Claude, DeepSeek wins on price and matches on most general chat / RAG, but Claude leads coding and agent loops. Against Gemini, DeepSeek V4 Flash is roughly the same price as Gemini 2.5 Flash; Gemini wins on multimodal and 2M context. The standard combination: DeepSeek for bulk, Claude for code, GPT-5.5 for voice / images, Gemini for long-context multimodal.

For a full side-by-side, see the API pricing index and the AI model leaderboard for quality / speed / value rankings.

Frequently asked questions about DeepSeek API pricing

What is DeepSeek API pricing in 2026?

DeepSeek V4 Pro is $1.74 per 1M input tokens and $3.48 per 1M output tokens. DeepSeek V4 is $0.50 / $1.50. DeepSeek V4 Flash is $0.14 / $0.28. the cheapest mainstream tier across all frontier providers. DeepSeek R1 (reasoning) is $0.55 / $2.19. Prompt caching at 90% off is available on all tiers.

How does DeepSeek pricing compare to OpenAI and Anthropic?

DeepSeek V4 Pro at $1.74 / $3.48 is roughly 1/8 the cost of GPT-5.5 Pro and ~3× cheaper than Claude Sonnet 4 at similar Arena Elo. DeepSeek V4 Flash at $0.14 input is approximately the cheapest serious LLM on the market.

Are DeepSeek models really comparable to GPT-5.5?

On Arena Elo, yes, DeepSeek V4 Pro sits at 1462, within 20 Elo points of GPT-5.5. On the hardest reasoning benchmarks (AAII, GPQA Diamond), GPT-5.5 and Claude Opus pull ahead. For general chat, classification, extraction, and RAG, DeepSeek V4 is competitive.

What is the open-weights advantage?

DeepSeek V4 is released under Apache 2.0: fully open weights, commercial use allowed, no restrictions on fine-tuning or self-hosting. For sovereignty-sensitive workloads (defence, healthcare, regulated finance) or for teams that need to train on proprietary data without sending it to a vendor, self-hosting DeepSeek is the standard path.

How does DeepSeek's prompt caching work?

DeepSeek offers 90% off the cached input rate; the same depth as Anthropic, the deepest in the industry. Caching is automatic on the hosted API; on self-hosted deployments, you implement caching via the serving framework (vLLM, SGLang, TRT-LLM).

Does DeepSeek offer batch inference?

Not on the hosted API as of mid-2026. For batch use cases, self-hosting on rented GPUs is the standard approach since the cost-per-token is already so low.

Can I fine-tune DeepSeek?

Yes, and fully. The Apache 2.0 weights allow any fine-tuning method, including full-parameter SFT, RLHF, and DPO. Common path: self-host the base model, fine-tune on your proprietary data, deploy on your own GPU fleet. Hosted fine-tuning is also offered through several third-party platforms.

Is DeepSeek safe for enterprise use?

For the hosted API, DeepSeek's data handling and compliance posture lags Anthropic / OpenAI / Google. most regulated enterprises do not use the hosted endpoint directly. The standard production pattern is to self-host the open weights or use a third-party SOC2-certified host (Together AI, Fireworks, DeepInfra) that exposes DeepSeek with enterprise controls.

What is DeepSeek R1 used for?

DeepSeek R1 is the reasoning-tier model, comparable to OpenAI o3, designed for math, complex code generation, and multi-step planning. Priced at $0.55 / $2.19 per 1M, it is ~20× cheaper than o3 at similar capability on the hardest reasoning benchmarks.

How much would self-hosting DeepSeek cost?

DeepSeek V4 (671B parameter mixture-of-experts) requires roughly 4× 8-GPU H100 nodes for low-latency serving: about $40-80/hr on rented GPU cloud, supporting 100M+ tokens/day at full utilisation. Below ~50% utilisation, the hosted API is cheaper. Above ~70% sustained utilisation, self-hosting wins by 3-5×.

Run DeepSeek on a gateway you control

Swfte routes traffic across every major provider, enforces prompt caching, applies per-team budgets, and logs every request for audit. OpenAI-compatible API. Free tier.

Free tier · SOC2 Type II · On-prem / VPC available