Glossary

What is prompt caching?

Prompt caching is a server-side optimization where a model provider (Anthropic, OpenAI, Google) stores the processed state of your prompt prefix. The next request that starts with the same prefix skips that work, so you pay a fraction of the cost and get a faster response. Cached reads run up to 90 percent cheaper.

Written by Ragavendra S, Founder of FRE|Nxt Labs. Last updated: April 25, 2026.

In one sentence

Send the same prefix twice, pay for it once.

The longer answer

The economics of repeated prompts

Most production LLM apps send prompts that share huge chunks of content. A support chatbot has the same 3,000-token system prompt on every message. A coding agent re-sends the same tool definitions every turn. A RAG app reuses the same 8 chunks across a multi-turn conversation. Without caching, you pay the model to re-read that same content every single call.

Prompt caching turns that waste into a line item. The provider hashes your prefix, stores the KV cache (the model’s internal state after reading it), and the next matching request starts from that state. For Claude Sonnet 4.6, a cached token read is roughly $0.30 per million versus $3.00 uncached, a 10x discount. Latency drops too because the model skips most of the prefill step.

The catch is prefix stability. The cache hits only when the beginning of the prompt is byte-for-byte identical. Put volatile content (timestamps, user input, random ids) at the front and you lose every hit. Every production AI system we audit either has prompt caching set up correctly or is leaving 50 to 80 percent of their model spend on the table.

How it works

The 4-step flow

  1. 1. First request writes

    You send a prompt and mark a cache breakpoint after the stable prefix (system prompt, tool definitions, docs). The provider processes the prefix, stores the KV cache, and charges a 25 percent markup on those input tokens.

  2. 2. Follow-up request hits

    A second request arrives whose beginning matches the cached prefix exactly. The provider reuses the stored state. You pay roughly 10 percent of the normal input price for those tokens and 100 percent only for the new tail.

  3. 3. TTL expires

    After 5 minutes (or 1 hour on Anthropic extended cache) the entry is evicted. The next request rewrites the cache and pays the write cost again.

  4. 4. You measure and iterate

    Track cache_read_input_tokens vs cache_creation_input_tokens in your logs. A healthy system sees 60 to 90 percent cache hit rate. Below 40 percent usually means your prefix is not stable.

When to use prompt caching

  • Multi-turn chat with a stable system prompt.
  • Coding agents with large tool definitions.
  • RAG apps where retrieved chunks repeat across a conversation.
  • Batch processing the same context across many inputs.
  • Any prefix over 1,024 tokens reused within 5 minutes.

When NOT to use prompt caching

  • One-shot calls where the prompt never repeats.
  • Prefixes under the minimum cacheable size (1,024 tokens on Anthropic).
  • Prompts where the first bytes always change (timestamps, random ids).
  • Extremely low-volume apps where cache writes never amortize.

Common mistakes

What we see go wrong

Putting volatile content first

A timestamp or request id in the first 100 tokens kills every cache hit. Push those to the end of the prompt or drop them.

Forgetting the cache breakpoint

On Anthropic you must explicitly add cache_control to mark what to cache. No breakpoint means no cache, even if the prefix is stable.

Cache hit rate not monitored

Without dashboards on cache_read vs cache_creation tokens, you cannot tell a 5 percent hit rate from an 80 percent one. Log it per call.

Invalidating the cache on every deploy

A tiny system prompt change wipes every cache. Batch prompt updates and deploy during low traffic so the rebuild cost is contained.

Relying on caching alone to save cost

Caching is layer one. You also need model routing (Haiku for easy tasks, Sonnet for medium, Opus for hard) and smaller context. Caching on top of an already-bloated prompt only masks the problem.

FAQ

Common questions about prompt caching

How much cheaper is a cached prompt?

On Anthropic, a cache read costs about 10 percent of the uncached input price. Cache writes cost 25 percent more than normal input. So if your prefix is reused at least twice, you save money. By the 5th hit you have saved over 80 percent on that prefix.

How long does the cache last?

Anthropic offers a 5-minute default TTL and a 1-hour option. OpenAI and Google have their own windows. Plan your traffic so related calls land inside the window. If your calls are spaced 10 minutes apart, the 5-minute cache is useless.

Do I need to do anything special to use prompt caching?

On Anthropic you add cache_control breakpoints to the request. On OpenAI it is automatic for prefixes above 1,024 tokens. On Google you create an explicit cached content object. Check your SDK docs, but all three major providers support caching in 2026.

What should I put in the cache?

System prompt, tool definitions, and any large context that does not change between calls (a knowledge base chunk, a full document). Put the user message last and uncached. That way the prefix is stable and gets hit repeatedly.

Can prompt caching break my output?

No. The cache stores the model state after processing the prefix, not the output. The model still generates fresh tokens for every request. You get identical-quality output at a fraction of the cost and latency.

LLM bill too high?

We have cut Anthropic and OpenAI bills 50 to 70 percent for production apps by fixing prompt caching and model routing. 30-min audit to find your biggest leaks.

Book a 30-min call