The Problem: One Model to Rule Them All
Most teams deploying LLM-powered applications make a costly mistake: they use a single model for everything. Every request — whether it is fixing a typo or designing a multi-tenant database schema — gets routed to the same expensive frontier model.
Consider the reality of a typical AI coding agent's workload:
- "Fix the typo in line 5" — a trivial task that a fast, cheap model handles in milliseconds
- "Implement the user authentication flow" — standard coding work requiring decent reasoning
- "Design a database schema for a multi-tenant SaaS" — complex architecture that genuinely needs frontier-level intelligence
If you are sending all three of these to Claude Opus, you are overpaying by an order of magnitude on the simple tasks. Haiku can handle quick fixes at roughly 1/10th the cost of Opus, with faster response times to boot.
The Solution: Dynamic Model Routing
The answer is a pattern we call dynamic model routing — letting the agent itself decide which model to use for each turn of the conversation. This is the same concept behind Cursor's popular "Auto Mode," where the editor intelligently picks the right model for each task.
The core idea is a three-tier model hierarchy:
- Low (Haiku) — Quick fixes, typos, simple changes. Fastest and cheapest.
- Medium (Sonnet) — Standard implementation, most coding tasks. The balanced default.
- High (Opus) — Architecture decisions, complex debugging, multi-step planning. Most capable.
The agent starts on Sonnet by default. When it encounters a task that warrants a different tier, it calls a tool to request a model switch. A middleware layer intercepts the next model call and routes it to the requested model. After one use, it resets back to the default.
Implementation with LangGraph
At FRE|Nxt Labs, we have implemented this pattern in production using LangGraph. The entire implementation is approximately 150 lines of Python across three files.
Step 1: Define the State
Add fields to your LangGraph agent state to track model-switching requests:
from typing import Annotated, Literal
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
"""Agent state with model switching support."""
messages: Annotated[list[BaseMessage], add_messages]
# Model switching fields
requested_model_level: Literal["low", "medium", "high"] | None
requested_model_reason: str | None
# Track current model (for agent self-awareness)
current_model: str | None
current_model_display: str | None
Step 2: Create the Model-Switching Tool
The tool uses LangGraph's Command pattern to update state properly:
from langchain.tools import tool, ToolRuntime
from langchain_core.messages import ToolMessage
from langgraph.types import Command
LEVEL_TO_MODEL = {
"low": "haiku",
"medium": "sonnet",
"high": "opus",
}
@tool
def set_next_model(
level: Annotated[
Literal["low", "medium", "high"],
"Model quality level: 'low' (fast), 'medium' (balanced), 'high' (powerful)",
],
reason: Annotated[str, "Brief explanation of why this model level is needed"],
runtime: ToolRuntime[None, AgentState],
) -> Command:
"""Switch to a different model quality level for the NEXT response.
Use this when you need different capability for upcoming work:
- low: Quick fixes, typos, simple changes (fastest, cheapest)
- medium: Standard implementation, most coding tasks (DEFAULT)
- high: Architecture, complex debugging, multi-step planning
"""
model_name = LEVEL_TO_MODEL.get(level, "sonnet")
tool_message = ToolMessage(
content=f"Model set to {model_name} ({level}) for next turn. Reason: {reason}",
tool_call_id=runtime.tool_call_id,
)
return Command(
update={
"messages": [tool_message],
"requested_model_level": level,
"requested_model_reason": reason,
}
)
A critical detail: you must include a ToolMessage in the command — LangGraph requires this when a tool returns a Command. Use Command(update={...}) rather than direct state mutation.
Step 3: Build the Routing Middleware
The middleware intercepts every model call and routes to the appropriate model based on state:
from langchain.agents.middleware import wrap_model_call
MODEL_DISPLAY_NAMES = {
"haiku": "Claude Haiku",
"sonnet": "Claude Sonnet",
"opus": "Claude Opus",
}
_MODEL_CACHE: dict[str, Any] = {}
def _get_model_for_key(model_key: str) -> ChatAnthropic:
if model_key in _MODEL_CACHE:
return _MODEL_CACHE[model_key]
model_map = {
"haiku": "claude-haiku-4-5-20251001",
"sonnet": "claude-sonnet-4-5-20250929",
"opus": "claude-opus-4-5-20251101",
}
model = ChatAnthropic(
model_name=model_map.get(model_key, model_map["sonnet"]),
max_tokens=16384,
)
_MODEL_CACHE[model_key] = model
return model
@wrap_model_call
async def model_selection_middleware(request, handler):
state = getattr(request, "state", {}) or {}
requested_level = state.get("requested_model_level")
if requested_level in ("low", "medium", "high"):
level_to_model = {"low": "haiku", "medium": "sonnet", "high": "opus"}
model_key = level_to_model[requested_level]
# Clear the request after use (one-shot behavior)
if request.state is not None:
request.state["requested_model_level"] = None
request.state["requested_model_reason"] = None
else:
model_key = "sonnet"
model = _get_model_for_key(model_key)
if request.tools:
model = model.bind_tools(request.tools, parallel_tool_calls=True)
request.model = model
return await handler(request)
Cost Analysis: Where the 10x Savings Come From
In a typical AI agent workload, tasks distribute roughly as follows:
- 60-70% of tasks are simple — quick fixes, typos, straightforward changes. Handled by Haiku.
- 20-25% of tasks are moderate — standard implementation work. Sonnet handles these well.
- 5-15% of tasks are complex — architecture decisions, multi-step debugging. These genuinely need Opus.
Haiku costs roughly 10x less than Opus per token. By routing the majority of simple tasks to Haiku and reserving Opus for tasks that truly need it, you achieve dramatic cost reductions without sacrificing quality where it matters.
For 1,000 agent turns:
- Without routing: 1,000 turns on Opus = baseline cost of $X
- With routing: 650 turns on Haiku + 250 turns on Sonnet + 100 turns on Opus ≈ $X/10
Production Tips
Fallback Strategies
Always default to a capable middle-tier model (Sonnet). If the routing middleware encounters an unknown state or fails to determine the right model, it should fall back gracefully rather than error out.
One-Shot Reset Behavior
After the requested model is used for one turn, the state resets to None. This prevents a single "high" request from keeping the agent on Opus indefinitely, which would negate the cost savings.
Why Not Switch Immediately?
The model switch takes effect on the next turn rather than immediately for three reasons:
- Message history integrity — Anthropic's API requires tool_use blocks to have matching tool_result blocks
- Reduced complexity — Managing partial responses during mid-turn switches is error-prone
- Better UX — Users do not see incomplete responses that get discarded
When NOT to Use Dynamic Routing
Skip it if your workload is uniformly complex, you are still prototyping and cost is not yet a concern, or your agent only handles a single type of task.
Results
In production implementations, this pattern delivers:
- 10x cost reduction on LLM spend by routing simple tasks to Haiku
- Quality preserved where it matters — complex tasks still get Opus-level reasoning
- Seamless user experience — model selection is transparent to users
- Full observability — logs show which model was selected and why for every turn
- Minimal overhead — approximately 150 lines of Python across 3 files
The key insight is that model switching happens between turns, not mid-generation. This keeps the implementation clean and avoids the complexity of interrupting model inference. Three files, one tool, one middleware — for significant and measurable cost savings.