Karpathy's march of nines (Oct 2025): why 90%→99.999% is the real AI problem

Read the field note below to see how we apply this pattern in real Claude Code projects.

verified 2 months ago7 min

The Karpathy Playbook #6: The march of nines

Final entry in the Karpathy Playbook. From 2017's Software 2.0 to 2025's reality check.

The interview

"Overall, the models are not there. I feel like the industry is making too big of a jump and is trying to pretend like this is amazing, and it's not."

"This is the decade of agents — not the year of agents."

On reinforcement learning: "Sucking supervision through a straw." When you only reward the final outcome, models get credit for wrong turns. That's noise, not learning.

Andrej Karpathy on Dwarkesh Patel's podcast, October 2025

The central concept: the march of nines. Getting an AI agent to 90% reliability (one nine) is the easy part. Each subsequent nine. 99%, 99.9%, 99.99%. Requires as much work as every nine before it combined. The journey from demo to production is not linear; it is asymptotic.

What we heard

This is the bookend to everything in this series. Entries 1–5 were about what LLMs can do: rewrite software, power new products, be programmed in English. Entry 6 is about the thing that separates "works in a demo" from "works for a real customer who is paying for it."

The march of nines is a physics fact of production software. What makes it worse for AI is that every nine costs more than the previous one. Not less, because LLM failure modes aren't well-distributed. The 10% that fails at 90% reliability contains the pathological cases. Fixing them is precisely the work people avoid because it's not fun.

Karpathy's decade-of-agents framing is the sensible response. The industry is in year one or two of that decade. Companies shipping production AI in 2026 are the ones that took the march of nines seriously in 2024 and 2025.

What we actually do with this

We run a nines audit on every system before calling it production-ready. The format:

Metric	1 nine (90%)	2 nines (99%)	3 nines (99.9%)	4 nines (99.99%)
Task completion	✅	✅	⚠	❌
Tool-call success	✅	✅	✅	⚠
Recovery from tool failure	✅	⚠	❌	❌
Graceful degradation	✅	⚠	❌	❌
Cost within budget per session	✅	✅	✅	⚠

A demo usually passes the first column. A production candidate passes the first two. Claiming production readiness without honestly auditing the right-hand columns is how agents end up in the graveyard of half-shipped AI products.

The practical work that closes each gap:

1→2 nines: Better prompts, basic evals, fallback paths.
2→3 nines: Tool-call retries with exponential backoff, observability on every step, circuit breakers.
3→4 nines: Recovery from every known failure mode (not just catching errors, but recovering from them), graceful degradation that preserves user value.
4→5 nines: Chaos engineering, adversarial testing, cost guardrails that hard-cap instead of soft-alert.

Most of our engagements live in the 2→3 nines transition. That is the hardest, least glamorous, most valuable work in production AI. And the work the industry under-invests in because it doesn't make for good demo videos.

Applied: what the nines looked like on InterviewLM

1→2 nines: Fixing the persona prompts that worked in scripted demos but drifted under adversarial candidate behaviour.
2→3 nines: Adding tool-call retries for the evaluation agent, checkpointing every LangGraph node so a failed transcript could resume from the last successful turn.
3→4 nines: Building the cost guardrail that hard-caps a runaway session at $3.00 before it can escalate, plus graceful degradation from Opus to Haiku when the primary provider rate-limits.

The InterviewLM system runs at ~3.5 nines in production. The march from where it started (1.5 nines) to 3.5 took roughly 70% of the total engineering effort, even though it looked like we built 90% of the system in the first 30% of the time. That ratio is the march of nines in practice.

The one thing to steal from this

Before you sign off on an AI feature, write the nines audit above. Be honest about which columns you pass. If you only pass one column, you have a demo. If you pass two, you have a pilot. If you pass three, you have a product. Don't call it production until you know which one it is.

Series complete. The Karpathy Playbook

Six entries, one intellectual arc, one pattern per entry for production AI work:

Software 2.0 (2017). Draw the line between deterministic and learned
LLM OS (2023). Architect every agent system with OS vocabulary
Eureka Labs (2024). Split the artifact from the teacher
Vibe coding (Feb 2025). Vibe phase fast, production gate slow
Software 3.0 (Jun 2025). Prompts are source code
March of nines (Oct 2025). Every nine costs more than the last

Next in the broader Playbooks project: Inside Anthropic with Dario Amodei. Machines of Loving Grace, Claude's constitution, the end-of-exponential debate, and what we build on top of their model spec.

Quick answers

What do I get from this cable?

You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.

How much time should I budget?

Typical effort is 7 min. The cable is marked intermediate.

How do I install the artifact?

This cable is guidance-only and does not ship an installable artifact.

How fresh is the guidance?

The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.

Work with FRE|Nxt

We build the production AI systems we write about.

Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:

Get a free 2-page audit Book a 30-min discovery call

Audit capacity: 5 slots/month · No pitch deck · NDA on request

Same shelf · Fix a specific problem

claude-code·no artifact

Use auto mode, not --dangerously-skip-permissions

Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…

@frenxt · 8 mininstall →

claude-code·skill

Publish your stack to Cables (automated)

A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…

@frenxt · 10 mininstall →

claude-code·skill

Replicate Ragav's stack (skills + plugins + scripts)

Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…

@frenxt · 15 mininstall →

Share this cable

Share on Twitter Share on LinkedIn