The Token-Efficient LLM Router — Local↔Cloud Routing and a Budget Ledger

  • llm
  • cost
  • routing
  • agents

Patterns for keeping agent systems fast, cheap, and inside a cost ceiling.

Problem

As of 2026, the number-one reason production agent systems break down in the field isn’t hallucination or a bad tool call — it’s cost. Agents loop. Retries stack up. A multi-step plan that should have cost a few cents quietly balloons into thousands of model calls. Anyone in this field has probably heard some version of the cautionary tale: an unattended agent burned roughly $4,200 over 63 hours before anyone noticed — not malice, not a model bug, just one infinite loop calling a frontier model on every step.

The naive fix is “use a cheaper model.” That’s too blunt a tool. Cheap models fail on the hard 20%, and the fallback to a frontier model usually happens after you’ve already paid for the failed cheap attempt. The money leaks out through retries.

The better fix is a router — deciding, before spending any tokens, where each request should run (and whether it should run at all). Three parts make this work:

  1. A deterministic difficulty classifier (zero tokens) — easy work goes local, hard work goes to the cloud.
  2. Prompt caching — so you’re never billed twice for repeated context.
  3. A budget ledger + hard ceiling + failure escalation — so a runaway loop hits a wall, not your credit-card limit.

This is the pattern behind my ai-research-lab router. Here’s how to build it.

Approach

1. Classify difficulty without calling a model

The instinct is to ask the LLM itself, “is this hard?” — but that spends tokens on every request, a chicken-and-egg cost. Instead, classify deterministically, from cheap signals:

  • Input length and structure
  • Presence of code, formulas, or multi-hop reasoning markers
  • Required tool calls / external lookups
  • An explicit hint from the caller (a complexity flag passed down from upstream)

A regex-plus-heuristics classifier runs at 0 tokens, in microseconds. It doesn’t need to be perfect — it just needs to be cheap and conservative. When in doubt, escalate, and let the budget ledger be the last line of defense.

2. Routing: easy → local, hard → cloud

Easy requests (summarization, classification, formatting, short Q&A) go to a local Ollama model — marginal cost is effectively zero, and there’s no network latency. Only genuinely complex requests — long-horizon reasoning, code generation, anything the classifier flags high — are allowed to spend cloud tokens.

3. Cache aggressively

A large share of an agent’s token spend is resending the same context: the system prompt, tool schemas, retrieved documents. Prompt caching bills that context once and lets you reference it cheaply afterward. For an agent with a stable system prompt and tool surface, this alone is a large saving.

4. Ledger + ceiling + escalation

Every call, local or cloud, is recorded in the budget ledger. The router checks remaining budget before it calls anything. If a request looks like it would exceed the ceiling, the router rejects or downgrades it rather than silently running it anyway. A local failure escalates to cloud exactly once, and that escalation itself is charged against the budget.

Code sketch

Generic and illustrative — no secrets, no vendor-specific keys.

from dataclasses import dataclass, field
import re

@dataclass
class Ledger:
    ceiling_usd: float
    spent_usd: float = 0.0
    calls: list = field(default_factory=list)

    def can_afford(self, est_usd: float) -> bool:
        return (self.spent_usd + est_usd) <= self.ceiling_usd

    def charge(self, route: str, usd: float):
        self.spent_usd += usd
        self.calls.append((route, usd))


def classify_difficulty(prompt: str, hint: str | None = None) -> str:
    """Zero-token, deterministic. Returns 'simple' or 'complex'."""
    if hint in {"simple", "complex"}:
        return hint
    signals = 0
    if len(prompt) > 2000:
        signals += 1
    if re.search(r"```|def |class |SELECT |import ", prompt):
        signals += 1          # code present
    if re.search(r"\b(prove|derive|optimi[sz]e|multi-step|plan)\b", prompt, re.I):
        signals += 1          # reasoning marker
    if re.search(r"\b(then|after that|finally|step \d)\b", prompt, re.I):
        signals += 1          # multi-hop
    return "complex" if signals >= 2 else "simple"


# Rough per-call cost (USD); local marginal cost is ~0
COST = {"local": 0.0, "cloud": 0.03}

def route(prompt: str, ledger: Ledger, hint: str | None = None) -> str:
    difficulty = classify_difficulty(prompt, hint)
    target = "local" if difficulty == "simple" else "cloud"

    if not ledger.can_afford(COST[target]):
        # Hard ceiling: downgrade or reject, never silently overspend
        if target == "cloud" and ledger.can_afford(COST["local"]):
            target = "local"   # gracefully downgrade
        else:
            raise BudgetExceeded(f"would exceed ceiling at {ledger.spent_usd:.2f}")

    result, ok = call_model(target, prompt)   # your model adapter

    # Local-first, with one controlled escalation
    if not ok and target == "local" and ledger.can_afford(COST["cloud"]):
        ledger.charge("local", COST["local"])
        result, ok = call_model("cloud", prompt)
        target = "cloud"

    ledger.charge(target, COST[target])
    return result


class BudgetExceeded(Exception):
    pass

The essence isn’t the exact heuristics — it’s the shape: a free classifier, a routing decision, a ledger check before spending, and an escalation that itself gets counted against the budget.

Results and trade-offs

What this pattern actually buys you:

  • The large majority of requests never touch the cloud. In a typical agent workload, simple, repetitive calls dominate by count. Route those locally, and most of the volume runs at ~0 marginal cost.
  • Cost becomes a boundary, not a hope. A runaway loop hits the ledger’s ceiling and stops. “A $4,200 blowup” turns into “the job stopped at its $20 budget and paged me.”
  • Latency drops — most of what’s routed locally skips the network round trip.

Trade-offs you accept:

  • Classifier errors cost money. A request wrongly sent local as “easy” that then fails pays for the local attempt plus the escalation. Keep the classifier conservative, and measure its precision.
  • Local capacity is a real constraint. Without hardware and model management (warm models, keep_alive), local latency spikes.
  • The ledger needs accurate cost estimates. If the estimates are garbage, the ceiling is meaningless. Regularly reconcile estimated vs. actual spend.

Lessons

  • Decide before you spend. The cheapest token is the one you route around and never use. A zero-token classifier in front of every request is worth more than a slightly cheaper model behind it.
  • A budget ceiling is a safety feature, not a finance feature. It’s the breaker that turns a catastrophic runaway into a logged, recoverable stop.
  • Local-first is a latency strategy as much as a cost strategy. Treat the cloud not as the default, but as an expensive escalation path.
  • Measure routing precision the way you’d measure model accuracy. The router is a model of your workload; get it wrong, and everything downstream pays for it.

What’s next: an adaptive classifier that learns routing thresholds from the ledger’s escalation history — so that instead of hand-tuned heuristics, the local/cloud split retunes itself automatically as the workload shifts.