Debug an agent from a LangSmith trace
Install a skill or read the field note below to see how we apply this pattern in real Claude Code projects.
When an agent fails, raw logs usually tell us *what* happened, not *why* it happened in that step sequence. Run this command to install a skill and start from a working baseline instead of rebuilding the setup from scratch.
Files this command writes (1 file)
.claude/skills/analyse-langsmith-trace/SKILL.md← artifact/SKILL.md
Run this skill with a concrete failing trace URL and one known-good comparison run.
Debug an agent from a LangSmith trace
When an agent fails, raw logs usually tell us what happened, not why it happened in that step sequence.
What we tried
To instrument a LangGraph agent for LangSmith, set these environment variables before running your agent:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your_key_here
export LANGCHAIN_PROJECT=your_project_name
That's it for basic tracing. Every LangGraph run will now appear in your LangSmith project dashboard automatically.
For better filtering later, attach metadata to your runs at the call site:
result = agent.invoke(
input,
config={
"metadata": {
"user_id": user_id,
"experiment_tag": "v2.3-routing-fix",
"task_class": "multi_file_refactor",
}
}
)
We used a trace-review skill that forces a consistent walkthrough:
- identify the failing run and its expected behavior
- isolate the first diverging step
- inspect tool inputs and outputs at that step
- classify root cause (prompt, tool contract, retrieval, guardrail, routing)
What happened
Once we standardized that walkthrough sequence, debugging stopped feeling like archaeology. But we also learned where the process breaks down.
Where Claude Code gets it wrong when reading a trace:
- It tends to focus on the error message at the end of the trace rather than the state that caused it. The error is often a symptom. The cause is usually two or three steps earlier, where the agent received unexpected tool output or made a bad routing decision based on ambiguous context.
- It misses silent retries. LangGraph can retry tool calls internally without surfacing them as explicit nodes. If a tool call fails and retries silently, the trace shows the eventual result but not the intermediate failure. You have to look at
total_tokensunexpectedly high for a simple task. That's often the signal that retries happened. - It over-weights the first node. The natural reading order is top-to-bottom, but the most expensive or most anomalous node is almost never the first one. We now start every trace review at the highest-latency or highest-token node, not the entry point.
Concrete walkthrough order we now use:
- Open the trace. Check
total_tokensandtotal_latencyat the run level. Are they in the expected range for this task type? - Find the highest-cost or highest-latency node. Start there, not at the top.
- Look at the inputs to that node. Is the state what you'd expect at that point in the workflow?
- Compare against a known-good trace for the same task type. The diff usually shows the divergence immediately.
- Paste the JSON of the diverging node into Claude Code context and ask: "This is a LangGraph node input/output. The agent was supposed to do X. What does this state tell you about why it didn't?"
What a good trace looks like versus a bad one at a glance:
A good trace: token counts increase steadily across nodes, tool calls have tight input/output pairs, no node has dramatically higher latency than its neighbours, and the final state matches the task's expected output structure.
A bad trace: one node consumes 60%+ of total tokens (prompt bloat or retrieval dumping too much context), a tool call node has an output that's an error object but the agent continues anyway, or the same tool is called twice with identical inputs (the agent is stuck in a loop. Look for tool_name repeating with tool_input identical across two consecutive calls).
What we learned
-
Always compare a failing trace against a known-good trace. The diff is almost always more informative than reading either trace in isolation. We keep a folder of reference traces for our three most common task types and paste them side-by-side when debugging.
-
Treat routing decisions as first-class failure points. Routing nodes are often written once and forgotten. They rarely have test coverage. A routing condition that evaluates to the wrong branch produces a confident-looking trace that proceeds entirely on the wrong path. No errors, wrong outcome.
-
Look for the loop signal. An agent stuck in a loop shows repeated tool calls with identical inputs. In LangSmith, this appears as two consecutive nodes of the same type where the
tool_inputfield is character-for-character identical. The agent is not making progress. It's waiting for a different output from the same call. The fix is almost always in the tool's output schema or the prompt's interpretation of empty/error results. -
Save root-cause notes in the repo. We maintain a
debug-notes/folder with a one-paragraph write-up for each non-obvious bug we've traced. When a similar symptom appears six months later, the note surfaces in Claude Code context search. This has saved us from re-investigating the same routing bug three times. -
Attach metadata before you need it. The
experiment_tagandtask_classfields feel optional until you're trying to filter 10,000 traces to find the 12 that match a specific workflow variant. Set them from day one.
When this doesn't fit
- Single-turn, stateless calls: LangSmith traces shine for multi-step agent workflows with tool calls and state transitions. For a simple chat completion, just read the API response directly.
- Production traces without sampling: if you're tracing 100% of production traffic, pasting traces into Claude Code context becomes expensive and noisy. Use LangSmith's filter and search to find a representative failing trace first. Tracing everything is fine for storage. The issue is the analysis step, not the collection step.
- When the bug is in the tool, not the agent: traces show what the agent decided. If your tool implementation is wrong, the trace shows correct decisions with wrong outputs. Fix the tool, not the prompt. The signal is that the agent's tool call inputs look correct but the outputs don't match what you'd expect from a working tool.
Next
- Next. Write your first skill from scratch.
Quick answers
What do I get from this cable?
You get a skill plus a dated field note that explains how we use it in real Claude Code workflows.
How much time should I budget?
Typical effort is 25 min. The cable is marked intermediate.
How do I install the artifact?
Run npx frenxt-cables add langsmith-trace-debug. The install block shows the files it writes and any prerequisites before you run it.
How fresh is the guidance?
The cable is explicitly last verified on 2026-04-15, and includes source links for traceability.
Work with FRE|Nxt
We build the production AI systems we write about.
Cables are the field notes. The playbooks come from client engagements — multi-agent systems, RAG pipelines, and LLM cost cuts that ship and hold up in production. If something here maps to a problem on your roadmap, two ways in:
Audit capacity: 5 slots/month · No pitch deck · NDA on request
Use auto mode, not --dangerously-skip-permissions
Two flags promise to stop Claude Code from pausing at every tool call. One of them reads your settings, honours your allowlist, and refuses to run anything g…
Publish your stack to Cables (automated)
A skill that walks Claude Code through publishing your Claude stack to the Cables community in one conversation. No manual repo setup, no hand-written `stack…
Replicate Ragav's stack (skills + plugins + scripts)
Pick the stack that matches what you're building. Each one is a single `npx` command. Plugins installed, skills synced, marketplaces configured, no bash scri…