Learn how to build LLM query routers that balance cost, speed, and quality. Use the Decide → Probe → Verify → Escalate pattern with rule based checks, verifiers, and escalation rules, then advance to learned routers with calibration, safety, and logging.
Promise: by the end of this guide, you’ll know how to decide when a query can be handled by a smaller, cheaper model — and when to escalate to a larger one — and you’ll be able to build a working LLM router (rule-based or learned) that delivers strong quality at a fraction of the cost.
If you run everything through your strongest model, you’ll likely get good answers — and an ugly bill. If you run everything through a small model, you’ll move fast — and quietly bleed quality and trust. Query routing is the middle path: a thin layer that triages each request and chooses the right capacity for this input.
Two ideas ground the practice:
Not all queries are equally hard. Many are routine (rewrite this, extract that); a minority are gnarly (ambiguous, multi-step, safety-sensitive). Sending both to the same model is wasteful.
Difficulty can often be inferred. You can estimate it from the input (length, domain, signals of math/code), or from a quick probe (a fast draft + confidence check), and then decide whether to escalate.
Research backs this up: FrugalGPT formalized cascades that route to cheaper or stronger LLMs per-query and reported large cost reductions at comparable quality; RouteLLM turned routing into a learnable decision problem trained on preference data, showing 2×+ cost cuts without quality loss in their evaluations. (arXiv, lmsys.org)
Routing means picking among models (or prompts) for an incoming query. The simplest form is a two-stage cascade:
Try a small model with tight instructions and a self-check.
If the self-check flags uncertainty or policy risk, escalate to a large model.
This is different from per-token speedups like CALM (early exits inside one model) or speculative decoding (a small model proposes tokens the big one verifies). Those speed up a fixed model; routing decides which model to call at all. You can combine them. (arXiv, NeurIPS Papers)
Where you’ll see routing:
Helpdesks & RPA: routine triaged to a small model; tricky or policy-heavy escalated.
Data extraction: straightforward fields to a small extractor; messy scans to a bigger model.
Mixed workloads: general chat, math, code, retrieval — each routed to the best specialist.
Frameworks exist if you prefer a head start (e.g., RouteLLM or LlamaIndex Router), but it’s valuable to understand the core mechanics first. (lmsys.org, docs.llamaindex.ai)
A good router balances a quality bar (what counts as “good enough” for the task) against a cost/latency budget (how much you can spend per query). The mental model:
Decide → Probe → Verify → Escalate (if needed).
Decide (pre-check): From the input alone, estimate difficulty and risk. Signals: domain (legal, medical, code), presence of math or tables, length, user intent (ask vs. task), and safety triggers.
Probe: If the pre-check says “likely easy,” call the small model with a strict format and a self-report of confidence.
Verify: Validate the small model’s output using cheap checks: JSON schema, regexes, invariants (“must include currency code”), retrieval-based grounding, or a tiny verifier model.
Escalate: If verification fails or confidence is low, escalate to the large model. Optionally include the small model’s draft as context to save tokens.
That’s the core pattern. The rest of this guide shows you how to implement it cleanly.
Below is a minimal, production-shaped design you can implement in any stack.
What it does: classifies a query into ROUTINE or COMPLEX using cheap heuristics; attempts the small model; self-checks; escalates only if needed.
Router policy (rule-based v1):
If input contains tell-tale signals of complexity (math operators, code blocks, “legal,” “diagnose,” URLs to crawl, or >600 tokens), skip straight to the large model.
Else, ask the small model for an answer and a JSON self-assessment:
{ "answer": ..., "confidence": 1-5, "needs_escalation": true/false, "reasons": [...] }
Accept small-model results only if confidence ≥ 4, no validation errors, and no safety flags.
Otherwise, escalate.
Self-check prompt (drop-in): Use this when you want the small model to both answer and judge whether to escalate.
You are a careful assistant. Answer the user’s request, then self-assess.
Return strict JSON with keys
Verifier (cheap):
If the task expects structured output, validate JSON schema.
If the task expects facts, require 1–2 citations from an internal index or RAG; reject if missing.
If the task expects math, recompute with a tiny Python sandbox (or a deterministic math tool).
What you’ll observe: on mixed corpora, this simple policy often routes 50–80% of traffic to the small model with little quality loss on routine asks. When you replace heuristics with a learned router, you usually push higher. (See FrugalGPT and RouteLLM for empirical reports.) (arXiv)
Rules are great until they’re not. A learned router usually pays for itself quickly on real workloads.
Ingredients for a learned router:
Labeled examples. For each query, record: input text, ground-truth acceptance (pass/fail against your rubric), small-model attempt, verifier pass/fail, and whether the large model fixed it.
Features. Start simple: input length; counts (digits, code fences, math symbols); domain keywords; embedding of the input (cosine sim to known “hard” clusters); small-model confidence; verifier signals (booleans).
Target. Binary: “small model good enough” vs. “escalate.” Optimize for cost-aware utility, not just accuracy.
Training loop: split by user/source to avoid leakage. Train a light classifier (logistic regression, gradient boosting). Calibrate probabilities (Platt or isotonic). Pick a threshold that maximizes Expected Utility = QualityScore − λ·Cost − μ·Latency for your business λ, μ. Re-calibrate regularly.
Modern variants: RouteLLM trains routers on preference data (A beats B on prompt X) and shows transfer: the same router can work even when “big” and “small” models change later. That’s powerful operationally. (arXiv)
Where to start if you don’t want to start from scratch: LlamaIndex’s RouterQueryEngine provides plug-ins to route among retrievers/query engines with an LLM classifier under the hood; the same patterns apply when the backends are different-sized models. IBM Research also describes production routing patterns in blog form. (docs.llamaindex.ai, IBM Research)
A mature router doesn’t only care about difficulty; it cares about risk.
Auto-escalate on sensitive domains. Health, legal, finance, or anything with irreversible consequences should default to your highest-quality path, or require human review.
Policy triggers. If the input matches restricted categories or the self-check expresses uncertainty about safety, escalate or stop.
Traceability. Log every routing decision with features and outcomes so auditors can reproduce “why small vs. large.”
Speculative decoding and CALM give you speedups inside a single model. They’re complementary: use them within your small or large model calls to reduce latency while the router reduces calls to the large model in the first place. (arXiv, NeurIPS Papers)
Prompt caching is a blunt, effective lever: if identical or near-duplicate prompts are common, cache normalized prompts/results before you even route.
Specialist models (code, math, vision) are just additional arms your router can choose. The learned router becomes even more valuable as the menu grows.
1) A tiny gating rubric (drop into your service code):
If len(tokens) > 600 → LARGE.
Else call SMALL with the self-check prompt above.
Parse JSON; if confidence ≥ 4 and verifiers pass → ACCEPT.
Else → LARGE (include user query + small draft as context, clearly delimited).
2) A schema to keep outputs predictable (use wherever correctness matters):
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "required": ["answer", "confidence", "needs_escalation", "reasons"], "properties": { "answer": { "type": "string" }, "confidence": { "type": "integer", "minimum": 1, "maximum": 5 }, "needs_escalation": { "type": "boolean" }, "reasons": { "type": "array", "items": { "type": "string", "maxLength": 80 }, "maxItems": 4 } } }
3) A learned router’s feature sketch (concise):
x = [ token_len, digit_frac, has_math_ops, has_code_fence, domain_legal, domain_med, domain_fin, emb_hardness_score, small_confidence, schema_ok, retrieval_citations_ok ] y = 1 if small_model_good_enough else 0
Train, calibrate, set threshold for expected utility. Update weekly.
It routes too much to the large model. Your threshold is conservative or the verifier is over-strict. Start by lowering the required confidence (e.g., from 4 to 3) only for low-risk intents and loosen a brittle schema (e.g., allow optional fields). Monitor quality.
It under-routes and quality drops. Add cheap signals: if digits>10 or math operators present, prefer escalation; for fact questions without citations, treat as “fail” unless the small output includes evidence.
Distribution shift. New content types appear and your learned router drifts. Detect with a simple KL divergence on feature distributions or a rising escalation rate. Retrain on the latest month of traffic.
The small model “passes” but users complain. Add a tiny verifier that’s different from the small generator (diversity helps). For math, use a deterministic tool. For facts, require citations to your internal index and check that the claims exist in the cited text.
Latency spikes from verification. Cap the number of checks per query, or move expensive checks (like multi-doc retrieval) to the large-model path only.
Goal: build a toy router on a dozen mixed queries and see routing decisions.
Create a CSV with columns: query,expected_type (rewrite|extract|math|fact),risk (low|high).
Rule-router:
If risk=high or expected_type in {math, fact} → LARGE.
Else → SMALL with the self-check JSON. Accept only if confidence ≥ 4.
Verification stubs:
For math, recompute with a calculator; mismatch → escalate.
For fact, require at least one citation string "[source: ...]"; missing → escalate.
Expected output (one line):
{ "query": "Summarize this 3-paragraph email for the team", "routed_to": "SMALL", "accepted": true, "confidence": 5, "escalated": false }
Try 3–4 obviously complex queries (multi-step math, policy-sensitive “should I take this medicine?”, multi-file code change requests). You should see most easy rewrites stay small; the hard ones escalate.
N-arm selection: add code-specialist, math-specialist, and a general large model. The router becomes a multi-class classifier with an “abstain/escalate” option.
Staged cascades: small → medium → large, with stricter checks at each stage. FrugalGPT described several cascade strategies (e.g., re-asking, paraphrasing) that can further reduce cost. (arXiv)
Multi-agent settings: routers can arbitrate among agents or tools; early 2025 work explores routing for multi-agent systems if that’s your direction. Not confirmed for mainstream use, but worth tracking. (arXiv)
Metrics that matter:
Acceptance rate at small model (fraction served without escalation).
Escalation correction rate (how often the large model materially improved the answer).
Cost per accepted answer and p95 latency.
Quality KPI tied to your rubric (exactness, faithfulness, safety incidents).
Observability:
Log features, router decision, model outputs, verifier results, human overrides.
Sample 1–5% of “small accepted” for human review each week.
Governance:
Keep a routing policy doc (when to escalate; what counts as “sensitive”).
Provide a user-visible banner or metadata when an answer was escalated (useful for trust and for debugging).
Tooling picks:
If you want a turnkey path, RouteLLM ships training/serving/eval scaffolding for routers; LlamaIndex’s Router can auto-select among engines (extend engines to be “small” vs. “large” models). (arXiv, docs.llamaindex.ai)
Routing is a practical lever: keep routine queries on fast, cheap models, and reserve your heavy artillery for the few that warrant it. The move is simple — Decide → Probe → Verify → Escalate — and it scales from a handful of heuristics to a fully learned, cost-aware policy.
Research like FrugalGPT and RouteLLM shows this isn’t a niche trick; it’s a robust way to ride the quality/cost frontier. Pair routing with in-model speedups (CALM, speculative decoding) and you get both fewer big-model calls and faster ones when you need them. (arXiv)
Start small: wire a self-check JSON, add a couple of verifiers, and escalate on doubt. Then graduate to a learned router trained on your traffic and thresholds. You’ll see costs drop, latency stabilize, and trust go up.
Ship the rule-based router with the self-check JSON and a couple of verifiers; log everything.
Label a week of traffic and train a small logistic/GBM router; calibrate for expected utility.
Evaluate systematically using a held-out set and track cost/latency/quality; iterate the threshold monthly.
FrugalGPT: How to Use LLMs While Reducing Cost and Increasing Performance. arXiv/TMLR. Cascades & routing with strong cost reductions. (arXiv, lchen001.github.io)
Confident Adaptive Language Modeling (CALM). Early-exit decoding for dynamic compute within a model. (arXiv, NeurIPS Papers)
Fast Inference from Transformers via Speculative Decoding. Parallelize decoding using a small proposal model. (arXiv)
LlamaIndex Router Query Engine. Practical router examples and APIs. (docs.llamaindex.ai)
Industry roundup. IBM Research overview of LLM routers in production. (IBM Research)
Optional browsing list: an “awesome” list of routing libraries (RouteLLM, Semantic Router, Unify) can give you more implementation paths. Treat it as a directory, not an authority. (github.com)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.