Harrison Chase: better models alone won't ship your agent (2025): the production gap

Read the field note below to see how we apply this pattern in real Claude Code projects.

verified 2 months ago7 min

Harrison Chase on Production Agents #3: The production gap

Part 3 of 5. Tracing Chase's production agent thinking with FRE|Nxt Labs commentary.

The argument

"Better models alone won't get your AI agent to production."

Harrison Chase (LangChain blog · VentureBeat interview), CEO of LangChain, consistent across 2025 keynotes and interviews

The full argument: the gap between "it works in a notebook" and "it works at scale with real users" is not primarily a model-capability gap. It's an orchestration gap. State management, error recovery, human-in-the-loop checkpoints, memory, and observability.

His framing for production agents: planning, memory, subagents, file systems, context/token management, and code execution. The model is one component. The harness is the product.

What we heard

This is the most important thing to understand about the multi-agent market right now. Every team we talk to has prototypes. Almost none have production. The failure mode is identical across them: they improved the model, the prompt, or the data. But they never built the orchestration layer.

The model is the CPU. You still need the operating system. (See Karpathy #2. LLM OS for the same idea from a different angle.)

Chase and Karpathy agree on the diagnosis. They disagree on whether raw capability is the ceiling or the floor (see Amodei #5 and Karpathy #6). Both views converge on the same practical answer: you have to build the harness regardless, and the harness is where most of the production work lives.

What we actually do with this

We have a five-layer production readiness checklist we run against every agent architecture before calling it production-ready. It maps directly to the gaps Chase describes:

Layer	What we check
State management	Can the agent resume from any checkpoint? Does state serialize cleanly?
Error recovery	What happens on tool failure? Does the agent retry, escalate, or halt?
Human-in-the-loop	Where are the required approval gates? Are they enforced in the graph?
Memory	What context persists across sessions? What gets pruned and when?
Observability	Is every agent step traced? Can we replay a failed session?

A prototype that passes none of these is a demo. A system that passes all five is a production candidate. Most systems we audit pass two.

The work required to go from "passes two" to "passes all five" is usually 60–70% of the total engineering effort on the engagement. It is also the work that disappears most in demos, which is why teams under-budget for it.

The LangGraph connection

We use LangGraph for every multi-agent engagement. The reason is exactly what Chase describes: LangGraph forces you to think in state machines. Every node is explicit. Every edge is a decision. There is no magic routing that works in a demo and fails in production.

The InterviewLM system was 8 agents, 100+ concurrent sessions, sub-2s p99 latency. None of that was achievable without the orchestration layer. We built the model-swapping logic, the prompt caching, and the evaluation harness on top of LangGraph. The models (Claude Sonnet/Haiku routed by task complexity) were almost interchangeable by the time we finished. The hard work was in the graph.

Applied: the InterviewLM incident log

Over six months of production, the InterviewLM agents have hit:

3 LLM provider outages (recovered via model fallback + checkpoint resume)
7 tool-call failures (retry with exponential backoff)
12 timeout events (graceful degradation to simpler model)
1 cost runaway attempt (hard-capped by cost guardrail)
0 session losses (all recoverable via LangGraph checkpointer)

None of these would have surfaced in a demo. All of them surfaced in production. All of them were handled by the orchestration layer, not by a better model. Chase's argument, applied.

The one thing to steal from this

Before your next demo-to-production scoping conversation, ask: what happens when a tool fails mid-session? If the answer is "the agent errors out and the user has to start over," you have a prototype, not a product. Error recovery is the difference.

Next in this series

#4. Deep Agents (July 2025). Chase's packaged answer: a harness that comes with planning, memory, subagents, and a virtual filesystem built in. What it gets right and when not to use it.

Quick answers

What do I get from this cable?

You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.

How much time should I budget?

Typical effort is 7 min. The cable is marked intermediate.

How do I install the artifact?

This cable is guidance-only and does not ship an installable artifact.

How fresh is the guidance?

The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.

Work with FRE|Nxt

We build the production AI systems we write about.

Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:

Get a free 2-page audit Book a 30-min discovery call

Audit capacity: 5 slots/month · No pitch deck · NDA on request

Same shelf · Fix a specific problem

claude-code·no artifact

Use auto mode, not --dangerously-skip-permissions

Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…

@frenxt · 8 mininstall →

claude-code·skill

Publish your stack to Cables (automated)

A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…

@frenxt · 10 mininstall →

claude-code·skill

Replicate Ragav's stack (skills + plugins + scripts)

Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…

@frenxt · 15 mininstall →

Share this cable

Share on Twitter Share on LinkedIn