The Token-Efficient LLM Router — Local↔Cloud Routing and a Budget Ledger
Patterns for keeping agent systems fast, cheap, and inside a cost ceiling.
Problem
As of 2026, the number-one reason production agent systems break down in the field isn’t hallucination or a bad tool call — it’s cost. Agents loop. Retries stack up. A multi-step plan that should have cost a few cents quietly balloons into thousands of model calls. Anyone in this field has probably heard some version of the cautionary tale: an unattended agent burned roughly $4,200 over 63 hours before anyone noticed — not malice, not a model bug, just one infinite loop calling a frontier model on every step.
The naive fix is “use a cheaper model.” That’s too blunt a tool. Cheap models fail on the hard 20%, and the fallback to a frontier model usually happens after you’ve already paid for the failed cheap attempt. The money leaks out through retries.
The better fix is a router — deciding, before spending any tokens, where each request should run (and whether it should run at all). Three parts make this work:
- A deterministic difficulty classifier (zero tokens) — easy work goes local, hard work goes to the cloud.
- Prompt caching — so you’re never billed twice for repeated context.
- A budget ledger + hard ceiling + failure escalation — so a runaway loop hits a wall, not your credit-card limit.
This is the pattern behind my ai-research-lab router. Here’s how to build it.
Approach
1. Classify difficulty without calling a model
The instinct is to ask the LLM itself, “is this hard?” — but that spends tokens on every request, a chicken-and-egg cost. Instead, classify deterministically, from cheap signals:
- Input length and structure
- Presence of code, formulas, or multi-hop reasoning markers
- Required tool calls / external lookups
- An explicit hint from the caller (a
complexityflag passed down from upstream)
A regex-plus-heuristics classifier runs at 0 tokens, in microseconds. It doesn’t need to be perfect — it just needs to be cheap and conservative. When in doubt, escalate, and let the budget ledger be the last line of defense.
2. Routing: easy → local, hard → cloud
Easy requests (summarization, classification, formatting, short Q&A) go to a local Ollama model — marginal cost is effectively zero, and there’s no network latency. Only genuinely complex requests — long-horizon reasoning, code generation, anything the classifier flags high — are allowed to spend cloud tokens.
3. Cache aggressively
A large share of an agent’s token spend is resending the same context: the system prompt, tool schemas, retrieved documents. Prompt caching bills that context once and lets you reference it cheaply afterward. For an agent with a stable system prompt and tool surface, this alone is a large saving.
4. Ledger + ceiling + escalation
Every call, local or cloud, is recorded in the budget ledger. The router checks remaining budget before it calls anything. If a request looks like it would exceed the ceiling, the router rejects or downgrades it rather than silently running it anyway. A local failure escalates to cloud exactly once, and that escalation itself is charged against the budget.
Code sketch
Generic and illustrative — no secrets, no vendor-specific keys.
from dataclasses import dataclass, field
import re
@dataclass
class Ledger:
ceiling_usd: float
spent_usd: float = 0.0
calls: list = field(default_factory=list)
def can_afford(self, est_usd: float) -> bool:
return (self.spent_usd + est_usd) <= self.ceiling_usd
def charge(self, route: str, usd: float):
self.spent_usd += usd
self.calls.append((route, usd))
def classify_difficulty(prompt: str, hint: str | None = None) -> str:
"""Zero-token, deterministic. Returns 'simple' or 'complex'."""
if hint in {"simple", "complex"}:
return hint
signals = 0
if len(prompt) > 2000:
signals += 1
if re.search(r"```|def |class |SELECT |import ", prompt):
signals += 1 # code present
if re.search(r"\b(prove|derive|optimi[sz]e|multi-step|plan)\b", prompt, re.I):
signals += 1 # reasoning marker
if re.search(r"\b(then|after that|finally|step \d)\b", prompt, re.I):
signals += 1 # multi-hop
return "complex" if signals >= 2 else "simple"
# Rough per-call cost (USD); local marginal cost is ~0
COST = {"local": 0.0, "cloud": 0.03}
def route(prompt: str, ledger: Ledger, hint: str | None = None) -> str:
difficulty = classify_difficulty(prompt, hint)
target = "local" if difficulty == "simple" else "cloud"
if not ledger.can_afford(COST[target]):
# Hard ceiling: downgrade or reject, never silently overspend
if target == "cloud" and ledger.can_afford(COST["local"]):
target = "local" # gracefully downgrade
else:
raise BudgetExceeded(f"would exceed ceiling at {ledger.spent_usd:.2f}")
result, ok = call_model(target, prompt) # your model adapter
# Local-first, with one controlled escalation
if not ok and target == "local" and ledger.can_afford(COST["cloud"]):
ledger.charge("local", COST["local"])
result, ok = call_model("cloud", prompt)
target = "cloud"
ledger.charge(target, COST[target])
return result
class BudgetExceeded(Exception):
pass
The essence isn’t the exact heuristics — it’s the shape: a free classifier, a routing decision, a ledger check before spending, and an escalation that itself gets counted against the budget.
Results and trade-offs
What this pattern actually buys you:
- The large majority of requests never touch the cloud. In a typical agent workload, simple, repetitive calls dominate by count. Route those locally, and most of the volume runs at ~0 marginal cost.
- Cost becomes a boundary, not a hope. A runaway loop hits the ledger’s ceiling and stops. “A $4,200 blowup” turns into “the job stopped at its $20 budget and paged me.”
- Latency drops — most of what’s routed locally skips the network round trip.
Trade-offs you accept:
- Classifier errors cost money. A request wrongly sent local as “easy” that then fails pays for the local attempt plus the escalation. Keep the classifier conservative, and measure its precision.
- Local capacity is a real constraint. Without hardware and model management (warm models,
keep_alive), local latency spikes. - The ledger needs accurate cost estimates. If the estimates are garbage, the ceiling is meaningless. Regularly reconcile estimated vs. actual spend.
Lessons
- Decide before you spend. The cheapest token is the one you route around and never use. A zero-token classifier in front of every request is worth more than a slightly cheaper model behind it.
- A budget ceiling is a safety feature, not a finance feature. It’s the breaker that turns a catastrophic runaway into a logged, recoverable stop.
- Local-first is a latency strategy as much as a cost strategy. Treat the cloud not as the default, but as an expensive escalation path.
- Measure routing precision the way you’d measure model accuracy. The router is a model of your workload; get it wrong, and everything downstream pays for it.
What’s next: an adaptive classifier that learns routing thresholds from the ledger’s escalation history — so that instead of hand-tuned heuristics, the local/cloud split retunes itself automatically as the workload shifts.