Updated May 15, 2026 · 7 min read

Prompt Caching (July 2026)

TL;DR: Prompt caching skips re-processing the stable prefix of a prompt. Anthropic charges ~10% for cache reads (explicit cache_control). OpenAI / Azure are automatic on prompts ≥ 1,024 tokens at ~50% discount. Gemini, DeepSeek, xAI each have their own knob. Healthy production cache hit rate is 60–90%. Below: per-provider pricing, 6 patterns, code, and the gotchas.

Prompt caching by provider

Provider	Cache read	Cache write	Notes
Anthropic (Claude)	~10% of base input	125% of base input (5 min TTL) / 200% (1 hour TTL)	Explicit cache_control on message blocks. Minimum 1,024 tokens (Haiku/Sonnet) or 2,048 (Opus). 5-minute and 1-hour TTLs.
OpenAI (GPT)	~50% of base input (automatic)	Free (no surcharge)	Automatic for prompts ≥ 1,024 tokens. No API knob. Best with stable prefix → variable suffix order.
Google (Gemini)	~25% of base input	Storage charged per-hour	Explicit "context caching" API. Minimum 32k tokens (Gemini 1.5 Pro), lower on newer models. Cache lives by TTL you set.
DeepSeek	~10% of base input (automatic)	Free	Automatic disk-tier cache for repeated prefixes. Among the cheapest cached reads in the market.
xAI (Grok)	~25% of base input	Free	Automatic prefix cache, no API knob.
AWS Bedrock	Provider-passed (e.g. ~10% on Claude)	Provider-passed surcharge	Mirrors Anthropic / others depending on the model family selected.
Azure OpenAI	~50% of base input (automatic)	Free	Mirrors OpenAI semantics on Azure-hosted GPT models.
Swfte gateway	Passes through provider discount	Passes through provider surcharge	Single cache_control surface across providers; emits cache-hit telemetry; auto-rewrites the request to take advantage of provider-specific caching.

6 prompt-caching patterns that actually pay off

Long system prompt

Long instructions / persona / tools list cached once, then re-used across every user message. Single biggest win for chat apps and agents.

Retrieval context

A retrieved corpus that's stable across a session (a book, a codebase, a PDF) cached as the prefix; the user question is the only changing part.

Few-shot examples

A long block of in-context examples for a classification / extraction task. The examples don't change between calls; cache them.

Tool definitions

A long list of function / tool schemas (10+ tools, each with a JSON schema) cached as part of the system prefix.

Conversational history

For long sessions, cache the growing transcript so the model doesn't re-tokenise + re-attend over hundreds of prior turns on each call.

Multi-document RAG

For domain-grounded agents that load the same large reference docs every call (regulatory text, product manuals), cache the reference set as the prefix.

The mechanics — what providers actually cache

When you send a prompt, the provider tokenises it and the model runs a forward pass to compute key-value (KV) states for every layer. Those KV states are the expensive part — they\'re what the model "remembers" about your prompt while generating. Prompt caching is just persisting those KV states keyed by a hash of the prompt prefix, so a subsequent request with the same prefix can skip the forward pass over the cached portion and reuse the saved KV state.

This is why caching only works on the prefix: the KV state for token N depends on tokens 1..N, so the moment your prompt diverges from the cache, everything after the divergence has to be recomputed. The practical implication is to order your prompt from most-stable to most-volatile — system prompt + tool definitions first, retrieval context next, user message last. Reversing that order kills your cache hit rate.

The TTL on the cache is short (5 minutes default for Anthropic, similar order for others) because KV state for a long prompt is memory-heavy — providers can't keep every prefix indefinitely. For sticky workloads, Anthropic offers a 1-hour TTL at a higher write surcharge; for bursty workloads the 5-minute TTL is almost always the right choice.

FAQ

What is prompt caching?

Prompt caching is a technique LLM providers use to skip re-processing the parts of a prompt they've already seen. When a request arrives, the provider checks if a prefix of the prompt matches one they've cached recently; if so, the cached KV state is re-used and the cached portion is billed at a 10–50% discount (and runs faster). It's the single biggest cost / latency optimisation for chat apps, agents, and long-context RAG.

How does prompt caching save money?

Two ways. Cost: cached input tokens are billed at 10–50% of normal input rate — for Anthropic Claude that's 90% off; for OpenAI / Azure / Gemini that's 50–75% off. Latency: cached prefixes skip the prompt-processing step entirely, often shaving 30–80% off time-to-first-token. For an agent with a 50k-token system prompt and a 200-token user message, you bill the 200 tokens at full price and the 50k at the cache rate — a 90%+ saving on a per-call basis.

Anthropic vs OpenAI vs Gemini prompt caching — what's the difference?

Anthropic Claude: explicit — you set cache_control on the message blocks you want cached. 5-minute or 1-hour TTL. Cache writes cost a 25–100% surcharge over base input, cache reads are ~10% of base input. OpenAI: automatic on prompts ≥ 1,024 tokens — no API knob — 50% discount on cached input. Gemini: explicit "context caching" API, minimum size varies by model, charged for cache storage per hour. DeepSeek / xAI: automatic prefix caching, 10–25% read pricing. Pick the model first, then design your prompts to maximise the prefix.

When is prompt caching worth it?

Three signals. (1) Stable prefix ≥ 1,024 tokens. If your system prompt and tool definitions are short, the savings are small. (2) High call volume to the same prefix. The cache amortises across calls — a one-off call doesn't recoup the write surcharge. (3) Latency-sensitive UX. The latency win often matters more than the cost win for chat / agent apps. The rule of thumb: long, stable prefix + ≥ 3 calls within the cache TTL = always worth it.

What are the limits of prompt caching?

Five honest constraints. (1) Cache TTL is short — Anthropic's default is 5 minutes; for sticky workloads use the 1-hour TTL or amortise across burst calls. (2) Cache works on the prefix only — change the system prompt by a single token and you lose the cache. (3) Cache keys are provider-specific and not exposed — you can't inspect or invalidate manually. (4) Cache write surcharge (Anthropic) means short bursts can be net-negative. (5) Some providers don't support caching for all model versions — check before assuming.

How do I implement prompt caching in code?

For Anthropic Claude: set cache_control on the message block you want cached. Example: { role: "system", content: [{ type: "text", text: longInstructions, cache_control: { type: "ephemeral" } }] }. For OpenAI: nothing to do — caching is automatic when the prompt is ≥ 1,024 tokens and the prefix matches a recent request. For Gemini: use the context caching API to create a cached content object, then reference its ID in subsequent generateContent calls. For multi-provider apps: route through a gateway (Swfte, LiteLLM, Portkey) that normalises the cache_control surface across providers.

How do I measure prompt cache hit rate?

Every provider returns cache metadata in the API response. Anthropic: usage.cache_read_input_tokens and cache_creation_input_tokens. OpenAI: usage.prompt_tokens_details.cached_tokens. Gemini: usage_metadata.cached_content_token_count. The KPI is cache-read tokens / total input tokens — a healthy agent app lands at 60–90%. If your hit rate is below 30%, the cause is usually prompt drift (system prompt rebuilt with a timestamp or random ID inside the cached region). Move the volatile parts to the end of the prompt.

Prompt caching vs RAG caching vs response caching — what's the difference?

Prompt caching skips re-processing prompt tokens — the model still runs, but cheaper and faster. RAG caching means caching retrieval results (vector DB hits) so you don't re-query the index. Response caching (semantic / exact) skips the model call entirely when an identical or sufficiently similar request was seen recently. The three stack: response cache catches identical-question scenarios, RAG cache catches identical-retrieval scenarios, prompt cache catches everything else where the prefix is stable. Strong production stacks layer all three.

Get prompt caching across every provider

Swfte's gateway normalises cache_control across Claude, GPT, Gemini, DeepSeek, xAI, Bedrock. One API, cache-hit telemetry, and automatic prompt rewriting to maximise hit rate.

Start free Talk to us

Free tier · OpenAI-compatible API · SOC2 Type II · On-prem available