The Token-Efficient LLM Router — Local↔Cloud Routing and a Budget Ledger

Patterns for keeping agent systems fast, cheap, and inside a cost ceiling.

Problem

As of 2026, the number-one reason production agent systems break down in the field isn’t hallucination or a bad tool call — it’s cost. Agents loop. Retries stack up. A multi-step plan that should have cost a few cents quietly balloons into thousands of model calls. Anyone in this field has probably heard some version of the cautionary tale: an unattended agent burned roughly $4,200 over 63 hours before anyone noticed — not malice, not a model bug, just one infinite loop calling a frontier model on every step.

The naive fix is “use a cheaper model.” That’s too blunt a tool. Cheap models fail on the hard 20%, and the fallback to a frontier model usually happens after you’ve already paid for the failed cheap attempt. The money leaks out through retries.

The better fix is a router — deciding, before spending any tokens, where each request should run (and whether it should run at all). Three parts make this work:

A deterministic difficulty classifier (zero tokens) — easy work goes local, hard work goes to the cloud.
Prompt caching — so you’re never billed twice for repeated context.
A budget ledger + hard ceiling + failure escalation — so a runaway loop hits a wall, not your credit-card limit.

This is the pattern behind my ai-research-lab router. Here’s how to build it.

Approach

1. Classify difficulty without calling a model

The instinct is to ask the LLM itself, “is this hard?” — but that spends tokens on every request, a chicken-and-egg cost. Instead, classify deterministically, from cheap signals:

Input length and structure
Presence of code, formulas, or multi-hop reasoning markers
Required tool calls / external lookups
An explicit hint from the caller (a complexity flag passed down from upstream)

A regex-plus-heuristics classifier runs at 0 tokens, in microseconds. It doesn’t need to be perfect — it just needs to be cheap and conservative. When in doubt, escalate, and let the budget ledger be the last line of defense.

2. Routing: easy → local, hard → cloud

Easy requests (summarization, classification, formatting, short Q&A) go to a local Ollama model — marginal cost is effectively zero, and there’s no network latency. Only genuinely complex requests — long-horizon reasoning, code generation, anything the classifier flags high — are allowed to spend cloud tokens.

3. Cache aggressively

A large share of an agent’s token spend is resending the same context: the system prompt, tool schemas, retrieved documents. Prompt caching bills that context once and lets you reference it cheaply afterward. For an agent with a stable system prompt and tool surface, this alone is a large saving.

4. Ledger + ceiling + escalation

Every call, local or cloud, is recorded in the budget ledger. The router checks remaining budget before it calls anything. If a request looks like it would exceed the ceiling, the router rejects or downgrades it rather than silently running it anyway. A local failure escalates to cloud exactly once, and that escalation itself is charged against the budget.

Code sketch

Generic and illustrative — no secrets, no vendor-specific keys.

from dataclasses import dataclass, field
import re

@dataclass
class Ledger:
    ceiling_usd: float
    spent_usd: float = 0.0
    calls: list = field(default_factory=list)

    def can_afford(self, est_usd: float) -> bool:
        return (self.spent_usd + est_usd) <= self.ceiling_usd

    def charge(self, route: str, usd: float):
        self.spent_usd += usd
        self.calls.append((route, usd))


def classify_difficulty(prompt: str, hint: str | None = None) -> str:
    """Zero-token, deterministic. Returns 'simple' or 'complex'."""
    if hint in {"simple", "complex"}:
        return hint
    signals = 0
    if len(prompt) > 2000:
        signals += 1
    if re.search(r"```|def |class |SELECT |import ", prompt):
        signals += 1          # code present
    if re.search(r"\b(prove|derive|optimi[sz]e|multi-step|plan)\b", prompt, re.I):
        signals += 1          # reasoning marker
    if re.search(r"\b(then|after that|finally|step \d)\b", prompt, re.I):
        signals += 1          # multi-hop
    return "complex" if signals >= 2 else "simple"


# Rough per-call cost (USD); local marginal cost is ~0
COST = {"local": 0.0, "cloud": 0.03}

def route(prompt: str, ledger: Ledger, hint: str | None = None) -> str:
    difficulty = classify_difficulty(prompt, hint)
    target = "local" if difficulty == "simple" else "cloud"

    if not ledger.can_afford(COST[target]):
        # Hard ceiling: downgrade or reject, never silently overspend
        if target == "cloud" and ledger.can_afford(COST["local"]):
            target = "local"   # gracefully downgrade
        else:
            raise BudgetExceeded(f"would exceed ceiling at {ledger.spent_usd:.2f}")

    result, ok = call_model(target, prompt)   # your model adapter

    # Local-first, with one controlled escalation
    if not ok and target == "local" and ledger.can_afford(COST["cloud"]):
        ledger.charge("local", COST["local"])
        result, ok = call_model("cloud", prompt)
        target = "cloud"

    ledger.charge(target, COST[target])
    return result


class BudgetExceeded(Exception):
    pass

The essence isn’t the exact heuristics — it’s the shape: a free classifier, a routing decision, a ledger check before spending, and an escalation that itself gets counted against the budget.

Results and trade-offs

What this pattern actually buys you:

The large majority of requests never touch the cloud. In a typical agent workload, simple, repetitive calls dominate by count. Route those locally, and most of the volume runs at ~0 marginal cost.
Cost becomes a boundary, not a hope. A runaway loop hits the ledger’s ceiling and stops. “A $4,200 blowup” turns into “the job stopped at its $20 budget and paged me.”
Latency drops — most of what’s routed locally skips the network round trip.

Trade-offs you accept:

Classifier errors cost money. A request wrongly sent local as “easy” that then fails pays for the local attempt plus the escalation. Keep the classifier conservative, and measure its precision.
Local capacity is a real constraint. Without hardware and model management (warm models, keep_alive), local latency spikes.
The ledger needs accurate cost estimates. If the estimates are garbage, the ceiling is meaningless. Regularly reconcile estimated vs. actual spend.

Lessons

Decide before you spend. The cheapest token is the one you route around and never use. A zero-token classifier in front of every request is worth more than a slightly cheaper model behind it.
A budget ceiling is a safety feature, not a finance feature. It’s the breaker that turns a catastrophic runaway into a logged, recoverable stop.
Local-first is a latency strategy as much as a cost strategy. Treat the cloud not as the default, but as an expensive escalation path.
Measure routing precision the way you’d measure model accuracy. The router is a model of your workload; get it wrong, and everything downstream pays for it.

What’s next: an adaptive classifier that learns routing thresholds from the ledger’s escalation history — so that instead of hand-tuned heuristics, the local/cloud split retunes itself automatically as the workload shifts.