Harrison Chase: better models alone won't ship your agent (2025) — the production gap

Read the field note below to see how we apply this pattern in practice.

verified today
Security: unaudited
SERIES Harrison Chase on Production Agents 03/05DIFFICULTY intermediateTIME 7 minCATEGORY ai-industryVERIFIED PUBLISHER FRE|Nxt LabsEdit on GitHub →
Need this in production?

Turn this cable into a shipping system.

We help teams deploy reliable AI workflows with architecture, implementation, and hardening support.

Harrison Chase on Production Agents #3: The production gap

Part 3 of 5 — tracing Chase's production agent thinking with FRE|Nxt Labs commentary.


The argument

"Better models alone won't get your AI agent to production."

Harrison Chase (LangChain blog · VentureBeat interview), CEO of LangChain, consistent across 2025 keynotes and interviews

The full argument: the gap between "it works in a notebook" and "it works at scale with real users" is not primarily a model-capability gap. It's an orchestration gap — state management, error recovery, human-in-the-loop checkpoints, memory, and observability.

His framing for production agents: planning, memory, subagents, file systems, context/token management, and code execution. The model is one component. The harness is the product.


What we heard

This is the most important thing to understand about the multi-agent market right now. Every team we talk to has prototypes. Almost none have production. The failure mode is identical across them: they improved the model, the prompt, or the data — but they never built the orchestration layer.

The model is the CPU. You still need the operating system. (See Karpathy #2 — LLM OS for the same idea from a different angle.)

Chase and Karpathy agree on the diagnosis. They disagree on whether raw capability is the ceiling or the floor (see Amodei #5 and Karpathy #6). Both views converge on the same practical answer: you have to build the harness regardless, and the harness is where most of the production work lives.


What we actually do with this

We have a five-layer production readiness checklist we run against every agent architecture before calling it production-ready. It maps directly to the gaps Chase describes:

| Layer | What we check | |---|---| | State management | Can the agent resume from any checkpoint? Does state serialize cleanly? | | Error recovery | What happens on tool failure? Does the agent retry, escalate, or halt? | | Human-in-the-loop | Where are the required approval gates? Are they enforced in the graph? | | Memory | What context persists across sessions? What gets pruned and when? | | Observability | Is every agent step traced? Can we replay a failed session? |

A prototype that passes none of these is a demo. A system that passes all five is a production candidate. Most systems we audit pass two.

The work required to go from "passes two" to "passes all five" is usually 60–70% of the total engineering effort on the engagement. It is also the work that disappears most in demos, which is why teams under-budget for it.


The LangGraph connection

We use LangGraph for every multi-agent engagement. The reason is exactly what Chase describes: LangGraph forces you to think in state machines. Every node is explicit. Every edge is a decision. There is no magic routing that works in a demo and fails in production.

The InterviewLM system was 8 agents, 100+ concurrent sessions, sub-2s p99 latency. None of that was achievable without the orchestration layer. We built the model-swapping logic, the prompt caching, and the evaluation harness on top of LangGraph. The models (Claude Sonnet/Haiku routed by task complexity) were almost interchangeable by the time we finished — the hard work was in the graph.


Applied: the InterviewLM incident log

Over six months of production, the InterviewLM agents have hit:

  • 3 LLM provider outages (recovered via model fallback + checkpoint resume)
  • 7 tool-call failures (retry with exponential backoff)
  • 12 timeout events (graceful degradation to simpler model)
  • 1 cost runaway attempt (hard-capped by cost guardrail)
  • 0 session losses (all recoverable via LangGraph checkpointer)

None of these would have surfaced in a demo. All of them surfaced in production. All of them were handled by the orchestration layer, not by a better model. Chase's argument, applied.


The one thing to steal from this

Before your next demo-to-production scoping conversation, ask: what happens when a tool fails mid-session? If the answer is "the agent errors out and the user has to start over," you have a prototype, not a product. Error recovery is the difference.


Next in this series

#4 — Deep Agents (July 2025). Chase's packaged answer: a harness that comes with planning, memory, subagents, and a virtual filesystem built in. What it gets right and when not to use it.

Quick answers

What do I get from this cable?

You get a dated field note that explains how we handle this ai-industry workflow in real Claude Code projects.

How much time should I budget?

Typical effort is 7 min. The cable is marked intermediate.

How do I install the artifact?

This cable is guidance-only and does not ship an installable artifact.

How fresh is the guidance?

The cable is explicitly last verified on 2026-04-17, and includes source links for traceability.

More from @frenxt

Share this cable