Fix your prompt caching

Install a skill or read the field note below to see how we apply this pattern in real Claude Code projects.

verified 2 months ago22 min

Install the workflow

Prompt caching is one of those optimizations we all think we enabled "well enough" until costs spike and response times drift. Run this command to install a skill and start from a working baseline instead of rebuilding the setup from scratch.

$npx frenxt-cables add prompt-caching-fix

Did install work?

Files this command writes (1 file)

.claude/skills/debug-prompt-caching/SKILL.md ← artifact/SKILL.md

Run this skill against a specific endpoint or prompt family so hit-rate recommendations stay actionable.

Fix your prompt caching

Prompt caching is one of those optimizations we all think we enabled "well enough" until costs spike and response times drift.

◉ The week caching silently regressed

We were refactoring how our agent assembled its system prompt. Moving from a hardcoded string to a config-driven builder that injected tool descriptions, user role context, and environment metadata at request time. The refactor looked clean. Tests passed. We shipped it on a Tuesday.

Latency moved first. By Wednesday morning, p95 on our busiest path had climbed about 18%. We assumed it was infrastructure noise (there had been Cloudflare hiccups that week) and let it sit.

Cost followed two days later. Thursday's billing snapshot showed input token spend up roughly 40% day-over-day. That's when we looked harder.

The problem was template string interpolation. Our new builder was injecting a generated_at timestamp near the top of the system prompt, before the stable instruction blocks, to help us correlate prompts with log entries. Every request produced a different prefix. The cache never hit.

What made it genuinely hard to debug: latency is noisy enough that an 18% shift doesn't scream "caching broke." And cost lags by a billing day on the Anthropic dashboard, so by the time the cost signal was obvious, we'd been running cache-cold for 48 hours across all traffic.

What we tried

We ran the prompt-caching-fix skill on our three busiest prompt-assembly paths. The skill works by reading the code that builds your system prompt, identifying patterns that break cache key stability, and estimating hit ratio from your API logs if you can supply them.

To install and run it:

npx frenxt-cables add prompt-caching-fix

Then open Claude Code in your project and run:

/prompt-caching-fix

The skill asks for two inputs:

The file (or function) that assembles your system prompt
A sample of recent API request logs, or the cache_creation_input_tokens / cache_read_input_tokens values from your Anthropic usage dashboard

It then walks through the prompt assembly code and flags:

Any variable values injected before the first stable block (timestamps, request IDs, session tokens)
Blocks where content changes between requests even if conceptually stable (metadata objects serialized in non-deterministic order, whitespace-sensitive templating)
Places where a <cache_control type="ephemeral"> breakpoint might allow partial caching of the stable prefix even if the tail changes

The output is a short ranked list of cache-break patterns with the line numbers, estimated hit-ratio impact, and a suggested fix for each.

What happened

We found three preventable issues across our three prompt paths:

Dynamic timestamp too early: the generated_at field was in the third line of our system prompt. Moving it to a comment at the end. Or dropping it entirely in favour of a correlation header. Recovered the full cache hit ratio on that path.
Changing metadata block: we were injecting a JSON blob of user feature flags. The flags themselves rarely changed, but we were serializing a Python dict, so key order was non-deterministic. Two requests with identical flags produced different prompt strings. We switched to json.dumps(flags, sort_keys=True) and the problem disappeared.
Over-eager templating whitespace: our Jinja template was adding a trailing newline after each injected block conditionally. When a block was absent, the whitespace pattern changed. Anthropic's cache key is exact-string, so even a single character difference breaks the hit. We normalized the template output with .strip() before returning it.

What surprised us: none of these were obvious from reading the code. They all looked fine until we compared the actual serialized prompt strings request-to-request.

What we learned

Stable at top, volatile at bottom. The Anthropic cache key matches on a prefix. Anything that changes between requests must come after the last stable block. In practice this means your system prompt should look roughly like:
```
[SYSTEM INSTRUCTIONS. Never changes]
[TOOL DESCRIPTIONS. Changes only on deploy]
[USER ROLE CONTEXT. Changes per user, not per request]
----
[REQUEST-SPECIFIC VALUES. Timestamps, session data, injected context]
```
The cache breakpoint lives at the ---- line. Everything above it is cached; everything below is not.
Treat cache hit ratio as a first-class performance signal. We now log cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens) on every request and alert if it drops below 0.6 on any high-volume path. A single line in your metrics pipeline catches regressions before billing does.
```
hit_ratio = usage.cache_read_input_tokens / (
    usage.cache_read_input_tokens + usage.cache_creation_input_tokens + 1e-9
)
metrics.gauge("llm.cache_hit_ratio", hit_ratio, tags=["path:agent_main"])
```

Add a lightweight regression check to CI. We run a smoke test that assembles the system prompt twice with identical inputs and asserts the output strings are identical. It takes about 200ms and has caught three regressions since we added it:

def test_prompt_assembly_is_deterministic():
    p1 = build_system_prompt(user_id="test", flags={"feature_x": True})
    p2 = build_system_prompt(user_id="test", flags={"feature_x": True})
    assert p1 == p2, "Prompt assembly is non-deterministic. Will break cache"

Prompt caching and semantic caching are different tools. Prompt caching (what this cable covers) works at the token-prefix level. It caches the KV computation for an exact prefix match. Semantic caching (e.g., GPTCache, Redis with embedding similarity) works at the response level and is useful when many requests ask similar questions. They solve different problems and can be used together, but don't mistake one for the other. If you're seeing cache misses on varied queries, prompt caching will not help. You want semantic caching or response memoization.

When this doesn't fit

Low-volume endpoints: prompt caching has a 5-minute TTL. If your endpoint gets fewer than a few requests per 5-minute window, the cache will rarely hit regardless of structure. Fix volume first.
Highly dynamic prompts: if every request requires a meaningfully different system prompt, the ROI of caching is low. Focus on caching the stable boilerplate layer and accepting dynamic cost for the rest.
Claude Haiku on low-cost workloads: prompt caching is most valuable for Sonnet and Opus where input costs are higher. For Haiku at low volume, the engineering time to stabilize caching may not recover its cost in savings. Run the math before optimizing.

Next. Debug an agent from a LangSmith trace.

Quick answers

What do I get from this cable?

You get a skill plus a dated field note that explains how we use it in real Claude Code workflows.

How much time should I budget?

Typical effort is 22 min. The cable is marked intermediate.

How do I install the artifact?

Run npx frenxt-cables add prompt-caching-fix. The install block shows the files it writes and any prerequisites before you run it.

How fresh is the guidance?

The cable is explicitly last verified on 2026-04-15, and includes source links for traceability.

Work with FRE|Nxt

We build the production AI systems we write about.

Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:

Get a free 2-page audit Book a 30-min discovery call

Audit capacity: 5 slots/month · No pitch deck · NDA on request

Same shelf · Fix a specific problem

claude-code·no artifact

Use auto mode, not --dangerously-skip-permissions

Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…

@frenxt · 8 mininstall →

claude-code·skill

Publish your stack to Cables (automated)

A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…

@frenxt · 10 mininstall →

claude-code·skill

Replicate Ragav's stack (skills + plugins + scripts)

Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…

@frenxt · 15 mininstall →

Share this cable

Share on Twitter Share on LinkedIn

Fix your prompt caching

Fix your prompt caching

What we tried

What happened

What we learned

When this doesn't fit

Next

Quick answers

We build the production AI systems we write about.

Use auto mode, not --dangerously-skip-permissions

Publish your stack to Cables (automated)

Replicate Ragav's stack (skills + plugins + scripts)