Harrison Chase: better models alone won't ship your agent (2025) — the production gap
Read the field note below to see how we apply this pattern in practice.
Turn this cable into a shipping system.
We help teams deploy reliable AI workflows with architecture, implementation, and hardening support.
Harrison Chase on Production Agents #3: The production gap
Part 3 of 5 — tracing Chase's production agent thinking with FRE|Nxt Labs commentary.
The argument
"Better models alone won't get your AI agent to production."
— Harrison Chase (LangChain blog · VentureBeat interview), CEO of LangChain, consistent across 2025 keynotes and interviews
The full argument: the gap between "it works in a notebook" and "it works at scale with real users" is not primarily a model-capability gap. It's an orchestration gap — state management, error recovery, human-in-the-loop checkpoints, memory, and observability.
His framing for production agents: planning, memory, subagents, file systems, context/token management, and code execution. The model is one component. The harness is the product.
What we heard
This is the most important thing to understand about the multi-agent market right now. Every team we talk to has prototypes. Almost none have production. The failure mode is identical across them: they improved the model, the prompt, or the data — but they never built the orchestration layer.
The model is the CPU. You still need the operating system. (See Karpathy #2 — LLM OS for the same idea from a different angle.)
Chase and Karpathy agree on the diagnosis. They disagree on whether raw capability is the ceiling or the floor (see Amodei #5 and Karpathy #6). Both views converge on the same practical answer: you have to build the harness regardless, and the harness is where most of the production work lives.
What we actually do with this
We have a five-layer production readiness checklist we run against every agent architecture before calling it production-ready. It maps directly to the gaps Chase describes:
| Layer | What we check | |---|---| | State management | Can the agent resume from any checkpoint? Does state serialize cleanly? | | Error recovery | What happens on tool failure? Does the agent retry, escalate, or halt? | | Human-in-the-loop | Where are the required approval gates? Are they enforced in the graph? | | Memory | What context persists across sessions? What gets pruned and when? | | Observability | Is every agent step traced? Can we replay a failed session? |
A prototype that passes none of these is a demo. A system that passes all five is a production candidate. Most systems we audit pass two.
The work required to go from "passes two" to "passes all five" is usually 60–70% of the total engineering effort on the engagement. It is also the work that disappears most in demos, which is why teams under-budget for it.
The LangGraph connection
We use LangGraph for every multi-agent engagement. The reason is exactly what Chase describes: LangGraph forces you to think in state machines. Every node is explicit. Every edge is a decision. There is no magic routing that works in a demo and fails in production.
The InterviewLM system was 8 agents, 100+ concurrent sessions, sub-2s p99 latency. None of that was achievable without the orchestration layer. We built the model-swapping logic, the prompt caching, and the evaluation harness on top of LangGraph. The models (Claude Sonnet/Haiku routed by task complexity) were almost interchangeable by the time we finished — the hard work was in the graph.
Applied: the InterviewLM incident log
Over six months of production, the InterviewLM agents have hit:
- 3 LLM provider outages (recovered via model fallback + checkpoint resume)
- 7 tool-call failures (retry with exponential backoff)
- 12 timeout events (graceful degradation to simpler model)
- 1 cost runaway attempt (hard-capped by cost guardrail)
- 0 session losses (all recoverable via LangGraph checkpointer)
None of these would have surfaced in a demo. All of them surfaced in production. All of them were handled by the orchestration layer, not by a better model. Chase's argument, applied.
The one thing to steal from this
Before your next demo-to-production scoping conversation, ask: what happens when a tool fails mid-session? If the answer is "the agent errors out and the user has to start over," you have a prototype, not a product. Error recovery is the difference.
Next in this series
#4 — Deep Agents (July 2025). Chase's packaged answer: a harness that comes with planning, memory, subagents, and a virtual filesystem built in. What it gets right and when not to use it.
Quick answers
What do I get from this cable?
You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.
How much time should I budget?
Typical effort is 7 min. The cable is marked intermediate.
How do I install the artifact?
This cable is guidance-only and does not ship an installable artifact.
How fresh is the guidance?
The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.
More from @frenxt
Anthropic's Responsible Scaling Policy (Sep 2023) — safety as operating procedure
*A five-part series tracing Anthropic's public thinking through Dario Amodei's writing and the company's model spec — one foundational document per entry, each with FRE|Nxt Labs l…
Anthropic's "brilliant friend" spec — the product voice that defines Claude
*Part 2 of 5 — tracing Anthropic's public thinking with FRE|Nxt Labs production commentary.*
Dario Amodei's Machines of Loving Grace (Oct 2024) — planning against the upside case
*Part 3 of 5 — tracing Anthropic's public thinking with FRE|Nxt Labs production commentary.*