You can cut LLM costs 50-70% without quality loss using three layered levers: dynamic model routing to send each request to the cheapest viable model tier (save 40-60%), prompt and token optimization to remove input and output waste (save 20-30%), and semantic plus prompt caching with request batching on repeat and bulk workloads (save 30-50%).
Last updated: May 17, 2026.
Your LLM Bill Is Probably 3-5x Too High
Most teams deploying LLMs in production are overspending by 3-5x. Not because they are using expensive models, but because they are using the wrong model for each request, making redundant API calls, and sending far more tokens than necessary.
At FRE|Nxt Labs, we have optimized LLM costs across multiple production deployments. The playbook is consistent: three levers that together reduce costs by 50-70% without any quality degradation. Here is how.
Lever 1: Dynamic Model Routing (Save 40-60%)
The single biggest cost optimization is using the right model for each request. Most teams default to GPT-4 or Claude Sonnet for everything. But 60-80% of requests do not need a frontier model.
How It Works
Build a routing layer that classifies incoming requests by complexity and routes them to the appropriate model tier:
- Tier 1 (Simple): GPT-4o-mini or Claude Haiku. Handles: simple lookups, formatting, classification, extraction from clean data. Cost: ~$0.15/1M input tokens.
- Tier 2 (Medium): GPT-4o or Claude Sonnet. Handles: summarization, analysis, moderate reasoning, code generation. Cost: ~$2.50/1M input tokens.
- Tier 3 (Complex): GPT-4 or Claude Opus. Handles: complex reasoning, multi-step planning, nuanced writing. Cost: ~$15/1M input tokens.
Building the Router
The router itself can be surprisingly simple. In our production deployments, we use a combination of:
- Heuristic rules: query length, presence of code blocks, request type from API metadata
- Lightweight classifier: a small model (or even a regex-based classifier) that predicts complexity from the query
- Fallback escalation: if a cheap model's response fails quality checks, automatically retry with a more capable model
In a recent engagement, we implemented dynamic model routing with LangGraph that reduced inference costs by 40% while maintaining identical output quality. The key insight: most of the "hard" requests were not actually hard, they just looked that way before classification.
Lever 2: Prompt & Token Optimization (Save 20-30%)
Every token you send to an LLM costs money. Most prompts contain significant waste, verbose instructions, redundant context, and unstructured outputs that consume unnecessary completion tokens.
Reduce Input Tokens
- Compress system prompts: most system prompts can be reduced by 30-50% without losing effectiveness. Remove examples that are redundant, tighten instructions, and use structured formats
- Truncate context: for RAG applications, only include the most relevant chunks. A well-tuned re-ranker selecting the top 3-5 chunks outperforms dumping 20 chunks into the context
- Use structured inputs: JSON or XML-structured inputs are more token-efficient than natural language descriptions of the same information
Reduce Output Tokens
- Structured outputs: use JSON mode or function calling instead of asking for free-form text. This eliminates verbose explanations and filler
- Max tokens limits: set appropriate max_tokens for each use case. A classification task does not need 2,000 tokens of completion
- Response streaming: stream responses and stop generation early when you have what you need
Lever 3: Caching & Batching (Save 30-50% on Repeat Queries)
Semantic Caching
Many LLM applications see significant query repetition. Customer support bots, search systems, and content tools often process queries that are semantically identical but worded differently.
Semantic caching works by:
- Embedding the incoming query
- Checking for similar queries in a vector cache (similarity threshold: 0.95+)
- Returning the cached response if a match is found
- Calling the LLM and caching the response if no match
In production, we have achieved 90%+ cache hit rates on customer-facing Q&A systems. That is a 90% reduction in LLM API calls for those workloads.
Prompt Caching
Both OpenAI and Anthropic now offer prompt caching, where repeated prefixes (like system prompts) are cached server-side at a discount. This is free performance, and the discounts below come from each provider's published pricing documentation:
- Anthropic: 90% discount on cached prompt tokens, per Anthropic's pricing docs
- OpenAI: 50% discount on cached prompt tokens, per OpenAI's pricing docs
Structure your prompts to maximize the shared prefix. Put the system prompt and static instructions first, and variable content (user query, retrieved context) last.
Request Batching
If your workload is not latency-sensitive, batch API calls. Both OpenAI and Anthropic offer batch APIs with 50% discounts. For background processing, evaluation, and bulk generation tasks, batching halves your costs instantly.
Measuring the Impact
You cannot optimize what you cannot measure. Before implementing any optimization, set up per-request cost tracking:
- Track cost per request: input tokens, output tokens, model used, total cost
- Track cost per feature: which product features drive the most LLM spend?
- Track quality metrics: ensure optimizations do not degrade output quality
- Set up alerts: catch cost spikes before they become expensive surprises
Tools like LangSmith make this straightforward, every LLM call is logged with token counts, latency, and cost. If you are not using an observability tool, you are optimizing blind.
The Bottom Line
LLM cost optimization is not a single technique, it is a layered strategy:
- Model routing saves 40-60% by using the right model for each request
- Prompt optimization saves 20-30% by reducing token waste
- Caching and batching saves 30-50% on repeat and bulk workloads
Combined, these strategies consistently deliver 50-70% cost reduction in our client engagements, often within the first month of deployment. The savings typically pay for the optimization engagement within 2-3 months, and then it is pure savings going forward.
If you are spending more than $1K/month on LLM APIs, there is almost certainly significant savings available. The question is not whether you can save. It is how much. If you would rather hand the playbook to a partner, that is exactly what our LLM cost optimization engagements deliver, with per-request cost tracking baked in.
FAQ
How much can I realistically cut my LLM bill?
In our client engagements the combined playbook consistently delivers 50-70% cost reduction, often within the first month of deployment, with no quality degradation. Model routing contributes 40-60%, prompt optimization 20-30%, and caching plus batching 30-50% on repeat and bulk workloads. The levers compound rather than overlap fully.
What is the single highest-impact LLM cost lever?
Dynamic model routing. Most teams default to a frontier model for everything, but 60-80% of requests do not need one. Classifying requests by complexity and routing them to the right tier is the biggest single saving. In a recent engagement this alone reduced inference costs 40% while keeping output quality identical.
What is semantic caching and when does it help?
Semantic caching embeds an incoming query, checks a vector cache for a near-identical prior query above a high similarity threshold, and returns the cached response on a match. It helps workloads with heavy query repetition like support bots and search. In our production work it reached 90%+ hit rates on customer-facing Q&A systems.
Why do I need cost tracking before optimizing?
You cannot optimize what you cannot measure. Before any change, track cost per request (input tokens, output tokens, model, total), cost per feature, and quality metrics, and set alerts for spikes. Without per-request tracking you are optimizing blind and cannot prove that a change cut cost without degrading output.