Harrison Chase: better models alone won't ship your agent (2025): the production gap
Read the field note below to see how we apply this pattern in real Claude Code projects.
Harrison Chase on Production Agents #3: The production gap
Part 3 of 5. Tracing Chase's production agent thinking with FRE|Nxt Labs commentary.
The argument
"Better models alone won't get your AI agent to production."
Harrison Chase (LangChain blog · VentureBeat interview), CEO of LangChain, consistent across 2025 keynotes and interviews
The full argument: the gap between "it works in a notebook" and "it works at scale with real users" is not primarily a model-capability gap. It's an orchestration gap. State management, error recovery, human-in-the-loop checkpoints, memory, and observability.
His framing for production agents: planning, memory, subagents, file systems, context/token management, and code execution. The model is one component. The harness is the product.
What we heard
This is the most important thing to understand about the multi-agent market right now. Every team we talk to has prototypes. Almost none have production. The failure mode is identical across them: they improved the model, the prompt, or the data. But they never built the orchestration layer.
The model is the CPU. You still need the operating system. (See Karpathy #2. LLM OS for the same idea from a different angle.)
Chase and Karpathy agree on the diagnosis. They disagree on whether raw capability is the ceiling or the floor (see Amodei #5 and Karpathy #6). Both views converge on the same practical answer: you have to build the harness regardless, and the harness is where most of the production work lives.
What we actually do with this
We have a five-layer production readiness checklist we run against every agent architecture before calling it production-ready. It maps directly to the gaps Chase describes:
| Layer | What we check |
|---|---|
| State management | Can the agent resume from any checkpoint? Does state serialize cleanly? |
| Error recovery | What happens on tool failure? Does the agent retry, escalate, or halt? |
| Human-in-the-loop | Where are the required approval gates? Are they enforced in the graph? |
| Memory | What context persists across sessions? What gets pruned and when? |
| Observability | Is every agent step traced? Can we replay a failed session? |
A prototype that passes none of these is a demo. A system that passes all five is a production candidate. Most systems we audit pass two.
The work required to go from "passes two" to "passes all five" is usually 60–70% of the total engineering effort on the engagement. It is also the work that disappears most in demos, which is why teams under-budget for it.
The LangGraph connection
We use LangGraph for every multi-agent engagement. The reason is exactly what Chase describes: LangGraph forces you to think in state machines. Every node is explicit. Every edge is a decision. There is no magic routing that works in a demo and fails in production.
The InterviewLM system was 8 agents, 100+ concurrent sessions, sub-2s p99 latency. None of that was achievable without the orchestration layer. We built the model-swapping logic, the prompt caching, and the evaluation harness on top of LangGraph. The models (Claude Sonnet/Haiku routed by task complexity) were almost interchangeable by the time we finished. The hard work was in the graph.
Applied: the InterviewLM incident log
Over six months of production, the InterviewLM agents have hit:
- 3 LLM provider outages (recovered via model fallback + checkpoint resume)
- 7 tool-call failures (retry with exponential backoff)
- 12 timeout events (graceful degradation to simpler model)
- 1 cost runaway attempt (hard-capped by cost guardrail)
- 0 session losses (all recoverable via LangGraph checkpointer)
None of these would have surfaced in a demo. All of them surfaced in production. All of them were handled by the orchestration layer, not by a better model. Chase's argument, applied.
The one thing to steal from this
Before your next demo-to-production scoping conversation, ask: what happens when a tool fails mid-session? If the answer is "the agent errors out and the user has to start over," you have a prototype, not a product. Error recovery is the difference.
Next in this series
#4. Deep Agents (July 2025). Chase's packaged answer: a harness that comes with planning, memory, subagents, and a virtual filesystem built in. What it gets right and when not to use it.
Quick answers
What do I get from this cable?
You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.
How much time should I budget?
Typical effort is 7 min. The cable is marked intermediate.
How do I install the artifact?
This cable is guidance-only and does not ship an installable artifact.
How fresh is the guidance?
The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.
Work with FRE|Nxt
We build the production AI systems we write about.
Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:
Audit capacity: 5 slots/month · No pitch deck · NDA on request
Use auto mode, not --dangerously-skip-permissions
Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…
Publish your stack to Cables (automated)
A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…
Replicate Ragav's stack (skills + plugins + scripts)
Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…