10x LLM Cost Savings with Dynamic Model Routing

Dynamic model routing cuts LLM costs roughly 10x by letting the agent pick a model tier per turn: cheap models like Haiku for simple fixes, mid-tier models for standard work, and frontier models only for genuinely hard tasks. A small LangGraph middleware intercepts each call, routes to the requested tier, then resets to a safe default.

Last updated: May 17, 2026.

The Problem: One Model to Rule Them All

Most teams deploying LLM-powered applications make a costly mistake: they use a single model for everything. Every request, whether it is fixing a typo or designing a multi-tenant database schema, gets routed to the same expensive frontier model.

Consider the reality of a typical AI coding agent's workload:

"Fix the typo in line 5": a trivial task that a fast, cheap model handles in milliseconds
"Implement the user authentication flow": standard coding work requiring decent reasoning
"Design a database schema for a multi-tenant SaaS": complex architecture that genuinely needs frontier-level intelligence

If you are sending all three of these to Claude Opus, you are overpaying by an order of magnitude on the simple tasks. Haiku can handle quick fixes at roughly 1/10th the cost of Opus, with faster response times to boot.

The Solution: Dynamic Model Routing

The answer is a pattern we call dynamic model routing: letting the agent itself decide which model to use for each turn of the conversation. This is the same concept behind Cursor's popular "Auto Mode," where the editor intelligently picks the right model for each task.

The core idea is a three-tier model hierarchy:

Low (Haiku): Quick fixes, typos, simple changes. Fastest and cheapest.
Medium (Sonnet): Standard implementation, most coding tasks. The balanced default.
High (Opus): Architecture decisions, complex debugging, multi-step planning. Most capable.

The agent starts on Sonnet by default. When it encounters a task that warrants a different tier, it calls a tool to request a model switch. A middleware layer intercepts the next model call and routes it to the requested model. After one use, it resets back to the default.

Implementation with LangGraph

At FRE|Nxt Labs, we have implemented this pattern in production using LangGraph. The entire implementation is approximately 150 lines of Python across three files.

Step 1: Define the State

Add fields to your LangGraph agent state to track model-switching requests:

from typing import Annotated, Literal
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    """Agent state with model switching support."""

    messages: Annotated[list[BaseMessage], add_messages]

    # Model switching fields
    requested_model_level: Literal["low", "medium", "high"] | None
    requested_model_reason: str | None

    # Track current model (for agent self-awareness)
    current_model: str | None
    current_model_display: str | None

Step 2: Create the Model-Switching Tool

The tool uses LangGraph's Command pattern to update state properly:

from langchain.tools import tool, ToolRuntime
from langchain_core.messages import ToolMessage
from langgraph.types import Command

LEVEL_TO_MODEL = {
    "low": "haiku",
    "medium": "sonnet",
    "high": "opus",
}

@tool
def set_next_model(
    level: Annotated[
        Literal["low", "medium", "high"],
        "Model quality level: 'low' (fast), 'medium' (balanced), 'high' (powerful)",
    ],
    reason: Annotated[str, "Brief explanation of why this model level is needed"],
    runtime: ToolRuntime[None, AgentState],
) -> Command:
    """Switch to a different model quality level for the NEXT response.

    Use this when you need different capability for upcoming work:
    - low: Quick fixes, typos, simple changes (fastest, cheapest)
    - medium: Standard implementation, most coding tasks (DEFAULT)
    - high: Architecture, complex debugging, multi-step planning
    """
    model_name = LEVEL_TO_MODEL.get(level, "sonnet")

    tool_message = ToolMessage(
        content=f"Model set to {model_name} ({level}) for next turn. Reason: {reason}",
        tool_call_id=runtime.tool_call_id,
    )

    return Command(
        update={
            "messages": [tool_message],
            "requested_model_level": level,
            "requested_model_reason": reason,
        }
    )

A critical detail: you must include a ToolMessage in the command, LangGraph requires this when a tool returns a Command. Use Command(update={...}) rather than direct state mutation.

Step 3: Build the Routing Middleware

The middleware intercepts every model call and routes to the appropriate model based on state:

from langchain.agents.middleware import wrap_model_call

MODEL_DISPLAY_NAMES = {
    "haiku": "Claude Haiku",
    "sonnet": "Claude Sonnet",
    "opus": "Claude Opus",
}

_MODEL_CACHE: dict[str, Any] = {}

def _get_model_for_key(model_key: str) -> ChatAnthropic:
    if model_key in _MODEL_CACHE:
        return _MODEL_CACHE[model_key]

    model_map = {
        "haiku": "claude-haiku-4-5-20251001",
        "sonnet": "claude-sonnet-4-5-20250929",
        "opus": "claude-opus-4-5-20251101",
    }

    model = ChatAnthropic(
        model_name=model_map.get(model_key, model_map["sonnet"]),
        max_tokens=16384,
    )
    _MODEL_CACHE[model_key] = model
    return model

@wrap_model_call
async def model_selection_middleware(request, handler):
    state = getattr(request, "state", {}) or {}
    requested_level = state.get("requested_model_level")

    if requested_level in ("low", "medium", "high"):
        level_to_model = {"low": "haiku", "medium": "sonnet", "high": "opus"}
        model_key = level_to_model[requested_level]

        # Clear the request after use (one-shot behavior)
        if request.state is not None:
            request.state["requested_model_level"] = None
            request.state["requested_model_reason"] = None
    else:
        model_key = "sonnet"

    model = _get_model_for_key(model_key)
    if request.tools:
        model = model.bind_tools(request.tools, parallel_tool_calls=True)
    request.model = model

    return await handler(request)

Cost Analysis: Where the 10x Savings Come From

In a typical AI agent workload, tasks distribute roughly as follows:

60-70% of tasks are simple: quick fixes, typos, straightforward changes. Handled by Haiku.
20-25% of tasks are moderate: standard implementation work. Sonnet handles these well.
5-15% of tasks are complex: architecture decisions, multi-step debugging. These genuinely need Opus.

According to Anthropic's published model pricing, Haiku costs roughly 10x less than Opus per token. By routing the majority of simple tasks to Haiku and reserving Opus for tasks that truly need it, you achieve dramatic cost reductions without sacrificing quality where it matters.

For 1,000 agent turns:

Without routing: 1,000 turns on Opus = baseline cost of $X
With routing: 650 turns on Haiku + 250 turns on Sonnet + 100 turns on Opus ≈ $X/10

Production Tips

Fallback Strategies

Always default to a capable middle-tier model (Sonnet). If the routing middleware encounters an unknown state or fails to determine the right model, it should fall back gracefully rather than error out.

One-Shot Reset Behavior

After the requested model is used for one turn, the state resets to None. This prevents a single "high" request from keeping the agent on Opus indefinitely, which would negate the cost savings.

Why Not Switch Immediately?

The model switch takes effect on the next turn rather than immediately for three reasons:

Message history integrity: Anthropic's API requires tool_use blocks to have matching tool_result blocks
Reduced complexity: Managing partial responses during mid-turn switches is error-prone
Better UX: Users do not see incomplete responses that get discarded

When NOT to Use Dynamic Routing

Skip it if your workload is uniformly complex, you are still prototyping and cost is not yet a concern, or your agent only handles a single type of task.

Results

In production implementations, this pattern delivers:

10x cost reduction on LLM spend by routing simple tasks to Haiku
Quality preserved where it matters, complex tasks still get Opus-level reasoning
Seamless user experience: model selection is transparent to users
Full observability: logs show which model was selected and why for every turn
Minimal overhead: approximately 150 lines of Python across 3 files

The key insight is that model switching happens between turns, not mid-generation. This keeps the implementation clean and avoids the complexity of interrupting model inference. Three files, one tool, one middleware, for significant and measurable cost savings. For a fuller picture of how routing fits alongside caching and prompt optimization in a budget, see our guide to AI development cost.

FAQ

How much can dynamic model routing actually save?

In our production implementations the pattern delivered roughly a 10x reduction in LLM spend. The savings come from the workload distribution: in a typical agent, 60-70% of tasks are simple and can run on a cheap model, while only 5-15% genuinely need a frontier model. Routing aligns spend with difficulty.

Why switch models on the next turn instead of mid-generation?

Three reasons: Anthropic's API requires tool_use blocks to have matching tool_result blocks, managing partial responses during a mid-turn switch is error-prone, and users should not see incomplete responses that get discarded. Switching between turns keeps the implementation clean and avoids interrupting model inference.

What model should be the default in a routing setup?

A capable middle tier such as Sonnet. The agent starts there, requests a higher or lower tier only when a task warrants it, and the middleware resets to the default after one use. This one-shot reset stops a single high-tier request from pinning the agent on an expensive model and erasing the savings.

When should I not use dynamic routing?

Skip it if your workload is uniformly complex, if you are still prototyping and cost is not yet a concern, or if your agent only handles a single type of task. Routing earns its keep when task difficulty varies widely, which is exactly when a single model for everything overpays the most.