Fix your prompt caching
Install a skill or read the field note below to see how we apply this pattern in real Claude Code projects.
Prompt caching is one of those optimizations we all think we enabled "well enough" until costs spike and response times drift. Run this command to install a skill and start from a working baseline instead of rebuilding the setup from scratch.
Files this command writes (1 file)
.claude/skills/debug-prompt-caching/SKILL.md← artifact/SKILL.md
Run this skill against a specific endpoint or prompt family so hit-rate recommendations stay actionable.
Fix your prompt caching
Prompt caching is one of those optimizations we all think we enabled "well enough" until costs spike and response times drift.
What we tried
We ran the prompt-caching-fix skill on our three busiest prompt-assembly paths. The skill works by reading the code that builds your system prompt, identifying patterns that break cache key stability, and estimating hit ratio from your API logs if you can supply them.
To install and run it:
npx frenxt-cables add prompt-caching-fix
Then open Claude Code in your project and run:
/prompt-caching-fix
The skill asks for two inputs:
- The file (or function) that assembles your system prompt
- A sample of recent API request logs, or the
cache_creation_input_tokens/cache_read_input_tokensvalues from your Anthropic usage dashboard
It then walks through the prompt assembly code and flags:
- Any variable values injected before the first stable block (timestamps, request IDs, session tokens)
- Blocks where content changes between requests even if conceptually stable (metadata objects serialized in non-deterministic order, whitespace-sensitive templating)
- Places where a
<cache_control type="ephemeral">breakpoint might allow partial caching of the stable prefix even if the tail changes
The output is a short ranked list of cache-break patterns with the line numbers, estimated hit-ratio impact, and a suggested fix for each.
What happened
We found three preventable issues across our three prompt paths:
- Dynamic timestamp too early: the
generated_atfield was in the third line of our system prompt. Moving it to a comment at the end. Or dropping it entirely in favour of a correlation header. Recovered the full cache hit ratio on that path. - Changing metadata block: we were injecting a JSON blob of user feature flags. The flags themselves rarely changed, but we were serializing a Python dict, so key order was non-deterministic. Two requests with identical flags produced different prompt strings. We switched to
json.dumps(flags, sort_keys=True)and the problem disappeared. - Over-eager templating whitespace: our Jinja template was adding a trailing newline after each injected block conditionally. When a block was absent, the whitespace pattern changed. Anthropic's cache key is exact-string, so even a single character difference breaks the hit. We normalized the template output with
.strip()before returning it.
What surprised us: none of these were obvious from reading the code. They all looked fine until we compared the actual serialized prompt strings request-to-request.
What we learned
-
Stable at top, volatile at bottom. The Anthropic cache key matches on a prefix. Anything that changes between requests must come after the last stable block. In practice this means your system prompt should look roughly like:
[SYSTEM INSTRUCTIONS. Never changes] [TOOL DESCRIPTIONS. Changes only on deploy] [USER ROLE CONTEXT. Changes per user, not per request] ---- [REQUEST-SPECIFIC VALUES. Timestamps, session data, injected context]The cache breakpoint lives at the
----line. Everything above it is cached; everything below is not. -
Treat cache hit ratio as a first-class performance signal. We now log
cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens)on every request and alert if it drops below 0.6 on any high-volume path. A single line in your metrics pipeline catches regressions before billing does.hit_ratio = usage.cache_read_input_tokens / ( usage.cache_read_input_tokens + usage.cache_creation_input_tokens + 1e-9 ) metrics.gauge("llm.cache_hit_ratio", hit_ratio, tags=["path:agent_main"]) -
Add a lightweight regression check to CI. We run a smoke test that assembles the system prompt twice with identical inputs and asserts the output strings are identical. It takes about 200ms and has caught three regressions since we added it:
def test_prompt_assembly_is_deterministic(): p1 = build_system_prompt(user_id="test", flags={"feature_x": True}) p2 = build_system_prompt(user_id="test", flags={"feature_x": True}) assert p1 == p2, "Prompt assembly is non-deterministic. Will break cache" -
Prompt caching and semantic caching are different tools. Prompt caching (what this cable covers) works at the token-prefix level. It caches the KV computation for an exact prefix match. Semantic caching (e.g., GPTCache, Redis with embedding similarity) works at the response level and is useful when many requests ask similar questions. They solve different problems and can be used together, but don't mistake one for the other. If you're seeing cache misses on varied queries, prompt caching will not help. You want semantic caching or response memoization.
When this doesn't fit
- Low-volume endpoints: prompt caching has a 5-minute TTL. If your endpoint gets fewer than a few requests per 5-minute window, the cache will rarely hit regardless of structure. Fix volume first.
- Highly dynamic prompts: if every request requires a meaningfully different system prompt, the ROI of caching is low. Focus on caching the stable boilerplate layer and accepting dynamic cost for the rest.
- Claude Haiku on low-cost workloads: prompt caching is most valuable for Sonnet and Opus where input costs are higher. For Haiku at low volume, the engineering time to stabilize caching may not recover its cost in savings. Run the math before optimizing.
Next
- Next. Debug an agent from a LangSmith trace.
Quick answers
What do I get from this cable?
You get a skill plus a dated field note that explains how we use it in real Claude Code workflows.
How much time should I budget?
Typical effort is 22 min. The cable is marked intermediate.
How do I install the artifact?
Run npx frenxt-cables add prompt-caching-fix. The install block shows the files it writes and any prerequisites before you run it.
How fresh is the guidance?
The cable is explicitly last verified on 2026-04-15, and includes source links for traceability.
Work with FRE|Nxt
We build the production AI systems we write about.
Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:
Audit capacity: 5 slots/month · No pitch deck · NDA on request
Use auto mode, not --dangerously-skip-permissions
Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…
Publish your stack to Cables (automated)
A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…
Replicate Ragav's stack (skills + plugins + scripts)
Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…