|
English

The first generation of LLM routers - the ones built between 2023 and 2025 - solved a single, well-defined problem: pick one of two models per request. RouteLLM, the seminal academic work from UC Berkeley and Anyscale (arxiv.org/abs/2406.18665), framed it as a binary classification task: send this query to GPT-4 or to Mixtral-8x7B. The numbers it produced were genuinely impressive. An 85% reduction in cost while preserving 95% of GPT-4's quality on MT-Bench. A 45% reduction on MMLU. A 35% reduction on GSM8K. A matrix-factorization router that achieved 95% of GPT-4's performance while routing only 26% of calls to GPT-4 - a 48% cost reduction. With augmented training data, the same approach pushed GPT-4 calls down to 14%, making it 75% cheaper than a random baseline.

Those numbers are still cited in 2026, and they should be. But anyone running production traffic in May 2026 has discovered that the single-router architecture - one classifier, two models, one decision per request - leaves an enormous amount of value on the table. The 2026 frontier is composing multiple routing techniques into a single gateway: a cascade for cost, a speculative race for latency, a semantic cache for repetition, and an ensemble of specialized routers - what this post calls Mixture-of-Routers (MoR) - voting on the final destination.

This is a deep technical post for engineers building or evaluating an AI gateway routing layer. We will define MoR formally, walk through five distinct LLM routing techniques 2026, work the token math for 1M requests/month at April 2026 prices, and end with a decision matrix and a quarterly action list. If you are still running a single LLM router in front of one or two models, the rest of this post is a guided tour of what you are missing.

Why Single-Router LLM Gateways Are 2024 Architecture

A single-router gateway has three structural weaknesses, none of which were obvious in 2024 when the first production deployments shipped, and all of which are now visible at scale.

The first weakness is objective conflict. A single router has to optimize for cost, accuracy, and latency simultaneously, and these three objectives almost never agree. Routing a coding request to a cheap model saves money but risks a wrong answer. Routing a long-context summarization request to a fast model saves latency but blows the context window. The single classifier has to flatten all of these dimensions into one decision, and the dimension that loses depends entirely on how the training labels were weighted.

The second weakness is distribution shift. Routers trained on 2024 traffic - mostly chat - perform poorly on 2026 traffic, which is dominated by agentic tool calls, multi-turn coding sessions, and structured-output requests. The Berkeley team showed that routers trained on Arena-Hard generalize to MT-Bench but degrade on out-of-domain tasks. In production, every customer has a slightly different traffic distribution, and the router has to be retrained continuously to keep up. A single router is a single point of staleness.

The third weakness is no internal hedging. If the cheap model fails - hallucinates, refuses, returns malformed JSON - the single router has no second opinion. The request either succeeds or it does not. Cascades, speculative races, and MoR all build internal hedging into the routing layer itself, which means the gateway can recover from a wrong first decision without the application code ever knowing.

The fix is not a smarter single router. The fix is composition: stack multiple routing techniques as middleware, each optimizing for a different objective, and let the gateway pick the right technique per request class.

The 5 Routing Techniques in 2026

Five techniques have emerged as the building blocks of a modern AI gateway. Each has a different cost-quality-latency profile, and a production gateway typically composes three or four of them.

TechniquePrimary objectiveCost reduction vs static strong modelAdded latencyImplementation complexity
Single Router (RouteLLM)Cost at fixed quality48-75%+5-15 msLow
Cascade (cheap to strong)Cost with quality floor70-80%+0-400 ms (when cascading)Medium
Speculative-RaceLatency with cost ceiling-5 to +20% (slightly more expensive)-30 to -60% (faster)Medium
Mixture-of-Routers (MoR)Multi-objective Pareto front75-85%+10-25 msHigh
Semantic-Cache-Aware RoutingCost on repeat traffic30-95% (depends on hit rate)-200 to +5 msMedium

Sources: RouteLLM (arxiv.org/abs/2406.18665), LMSYS Arena Leaderboard, llm-stats.com/llm-updates, LMCouncil Benchmarks, Swfte production telemetry (April 2026).

The next five sections walk each technique in detail. The math, prices, and latency numbers are calibrated to April 2026 - the buildfastwithai best AI models leaderboard and llm-stats updates are good cross-references for the prices that follow.

Technique 1: Single Router (RouteLLM Baseline)

The single-router pattern is the academic baseline and still the right starting point for new deployments. A classifier - typically a small fine-tuned BERT or a matrix-factorization model - takes the user prompt as input and outputs a probability that the strong model is needed. A threshold turns that probability into a binary routing decision.

prompt -> [router classifier] -> P(strong_model_needed)
                                       |
                          P >= tau  ----+---- P < tau
                              |              |
                          strong            cheap
                          model             model

RouteLLM's contribution was empirical: across four routing methods (similarity-weighted ranking, matrix factorization, BERT classifier, causal LLM classifier), the matrix-factorization variant hit 95% of GPT-4's MT-Bench score while sending only 26% of queries to GPT-4. That is a 48% cost reduction at near-flagship quality on chat. With augmented training data drawn from a stronger judge, the same router got down to 14% GPT-4 calls - 75% cheaper than random routing at the same quality bar.

The single-router pattern has three limitations that motivated everything that came after it. It optimizes one objective (cost-at-quality), it makes one decision per request (no recovery), and it is brittle to distribution shift (the threshold drifts as traffic changes).

For internal context on how a single-router approach maps to a multi-model deployment, see Intelligent LLM Routing and AI Model Routing Cost Optimization.

Technique 2: Cascade (Cheap to Strong)

A cascade flips the routing question. Instead of asking "which model should answer this," it asks "did the cheap model's answer pass quality gates, and if not, escalate." The pattern is sequential.

prompt -> cheap model -> answer -> [verifier] -> pass? -> return
                                      |
                                    fail
                                      |
                                  mid model -> [verifier] -> pass? -> return
                                                  |
                                                fail
                                                  |
                                              strong model -> return

The verifier is the interesting design choice. Three options dominate in 2026: a self-consistency check (ask the cheap model twice, escalate on disagreement), a small dedicated judge model (DeepSeek V4 Flash at $0.14/$0.28 per million tokens is the popular choice), and a structured-output schema validator (the cheapest verifier - if the JSON is malformed, escalate).

Cascades shine when the traffic is heavily skewed toward easy requests. If 60% of your traffic is trivial - simple questions, formatting, classification - a cascade will resolve those at the cheap-model price and only escalate the 40% that genuinely need help. The cost reduction compounds because the verifier itself is cheap and the escalation is rare.

The cost of a cascade is latency on the hard tail. A request that escalates twice pays the latency of the cheap model + verifier + mid model + verifier + strong model. That is typically 600-900 ms of added latency on the tail, which is unacceptable for interactive UIs but fine for background agents.

Technique 3: Speculative-Race (Latency Optimized)

Speculative-race is the latency-optimized inverse of a cascade. Instead of running models sequentially, the gateway fires the same prompt at the cheap and strong model simultaneously, and a verifier picks the first acceptable answer.

                    +-> cheap model  ---+
prompt -> dispatch -+                   +-> [verifier] -> first acceptable
                    +-> strong model ---+

The cheap model usually returns first. If its answer passes the verifier, the strong model's call is canceled (or completed and discarded - the cancellation savings depend on the provider). If the cheap answer fails, the strong answer is already most of the way through generation by the time the verifier rejects, so the user sees the strong answer with minimal added latency.

Speculative-race trades cost for latency. You are paying for two inferences on every speculatively-raced request, but the user-perceived latency drops by 30-60% versus a serial cascade. The right traffic for speculative-race is interactive: chat UIs, code completion, voice agents - workloads where the p95 latency matters more than the per-request cost.

A 2026 refinement is adaptive speculation: the gateway only speculates when a confidence-prediction model thinks the cheap answer might fail. On easy requests, only the cheap model fires; on borderline requests, both fire. This pulls the average cost back down without giving up the latency win.

Technique 4: Mixture-of-Routers (MoR)

This is the framework this post is named after. The single-router pattern has one classifier making one decision. Mixture-of-Routers has an ensemble of specialized routers, each optimizing one objective, voting on the final destination.

Formal definition

Let R = {r_1, r_2, ..., r_k} be a set of k specialized routers. Each router r_i is a function from a request x to a probability distribution over a candidate model set M = {m_1, ..., m_n}:

r_i: X -> Delta(M)

Each router optimizes a different objective. In the canonical three-router MoR:

  • r_cost minimizes expected dollar cost subject to a quality floor
  • r_accuracy maximizes expected quality subject to a cost ceiling
  • r_latency minimizes expected p95 latency subject to a quality floor

The MoR aggregator combines the per-router distributions into a final routing decision. The simplest aggregator is a weighted vote:

P_final(m | x) = sum_i  w_i(x) * r_i(m | x)

where w_i(x) is the per-request weight assigned to router i. The weights themselves are a function of the request - a long-context request gets higher weight on r_accuracy, an interactive chat request gets higher weight on r_latency, a batch backfill gets higher weight on r_cost.

Worked numerical example

Suppose three routers vote on a request, with two candidate models (Cheap and Strong):

r_cost      ->  P(Cheap) = 0.85,  P(Strong) = 0.15
r_accuracy  ->  P(Cheap) = 0.30,  P(Strong) = 0.70
r_latency   ->  P(Cheap) = 0.90,  P(Strong) = 0.10

A static aggregator with equal weights w = (1/3, 1/3, 1/3) produces:

P_final(Cheap)  = (0.85 + 0.30 + 0.90) / 3 = 0.683
P_final(Strong) = (0.15 + 0.70 + 0.10) / 3 = 0.317

The request routes to Cheap.

A context-aware aggregator that knows this is a coding request shifts the weights to w = (0.2, 0.6, 0.2):

P_final(Cheap)  = 0.2*0.85 + 0.6*0.30 + 0.2*0.90 = 0.530
P_final(Strong) = 0.2*0.15 + 0.6*0.70 + 0.2*0.10 = 0.470

Still Cheap, but the margin is much narrower. A small calibration nudge - say, a 0.1 confidence floor that escalates ties to Strong - flips the decision. This is the central MoR property: the routing decision is sensitive to context, not just to prompt content, because each router exposes its specialized vote for the aggregator to weight.

Why MoR beats single-router

Three reasons. First, objective decomposition: each router is trained on its own objective with its own labels, which is a much easier learning problem than a single multi-objective router. Second, per-request reweighting: the gateway can shift weights based on tenant SLAs, time-of-day cost budgets, or A/B test arms, without retraining any router. Third, graceful degradation: if one router goes stale, the other two still produce a sensible vote, and a stale-detection module can drop the bad router's weight to zero until it is retrained.

The cost is implementation complexity. MoR requires k routers in the inference path, k training pipelines, and an aggregator that itself can be tuned. The overhead is typically 10-25 ms per request - small relative to LLM latency but not free.

Technique 5: Semantic-Cache-Aware Routing

The fifth technique sidesteps inference entirely on a fraction of traffic. A semantic cache stores past (prompt, response) pairs keyed by embedding similarity. On a new request, the gateway computes the embedding, queries the cache, and if a sufficiently similar past prompt exists with a still-valid response, returns the cached answer.

prompt -> [embed] -> [vector search] -> sim >= threshold ?
                                           |
                              yes  --------+--------  no
                               |                       |
                         return cached            run router
                                                       |
                                                 store new pair

Semantic caching has a cost-and-latency profile unlike any other technique. A cache hit costs effectively zero (just the embedding + vector search, typically 5-20 ms) and serves the response 200-2000 ms faster than any inference path. A cache miss adds the embedding cost (which is tiny) plus the vector-search latency (a few ms) on top of whatever the router decides next.

The hit-rate distribution on real traffic is bimodal. Public-facing chat sees 5-15% semantic-cache hit rates - users phrase things differently. Internal tools and API integrations see 40-70% hit rates - the same support macros, the same SQL templates, the same JSON schemas appear over and over. The 2026 best practice is to provision a per-tenant semantic cache with a per-tenant similarity threshold, because the right threshold is workload-dependent.

The danger of a semantic cache is stale answers. If a customer's product price changed, the cache must invalidate. The mitigations are TTLs (default 24 hours), explicit invalidation hooks tied to source-of-truth changes, and a "cache off" flag for high-stakes responses (financial, medical, legal).

For a deeper dive on stacking semantic-cache with multi-provider routing in production, see Multi-Provider Routing with Swfte Connect.

Worked Token Math: 1M Requests/Month

This section grounds the techniques in actual dollar figures. The setup: 1,000,000 requests per month, traffic mix 60% trivial / 30% mid-difficulty / 10% complex, average 800 input tokens and 400 output tokens per request. Prices are April 2026, sourced from llm-stats.com/llm-updates and the buildfastwithai May 2026 leaderboard.

Price table (April 2026, per 1M tokens)

ModelInputOutputQuality tier
Claude Opus 4.7$15.00$75.00Flagship
GPT-5.5$5.00$15.00Strong
Gemini 3.1 Pro$3.50$10.50Strong
DeepSeek V4 Pro$1.74$3.48Mid
DeepSeek V4 Flash$0.14$0.28Cheap

Per-request cost

Cost per request = (800 / 1e6) * input_price + (400 / 1e6) * output_price

Claude Opus 4.7   = 0.0008 * 15    + 0.0004 * 75    = $0.012  + $0.030  = $0.042
GPT-5.5           = 0.0008 *  5    + 0.0004 * 15    = $0.004  + $0.006  = $0.010
Gemini 3.1 Pro    = 0.0008 *  3.5  + 0.0004 * 10.5  = $0.0028 + $0.0042 = $0.0070
DeepSeek V4 Pro   = 0.0008 *  1.74 + 0.0004 *  3.48 = $0.00139+ $0.00139= $0.00279
DeepSeek V4 Flash = 0.0008 *  0.14 + 0.0004 *  0.28 = $0.000112+ $0.000112= $0.000224

Static baselines

If every request goes to one model:

Static Claude Opus 4.7   1,000,000 * $0.042    = $42,000 / month
Static GPT-5.5           1,000,000 * $0.010    = $10,000 / month
Static Gemini 3.1 Pro    1,000,000 * $0.0070   = $7,000  / month
Static DeepSeek V4 Pro   1,000,000 * $0.00279  = $2,790  / month

These are the baselines. A real production traffic at this scale is typically routed, not static, but the baseline answers the question "what does it cost to be safe."

Cascade: DeepSeek Flash to GPT-5.5 to Claude Opus

A three-tier cascade with conservative escalation rates. Assume the verifier is cheap (a small judge model adding $0.0002 per verified request) and the escalation rates are 25% from Flash to GPT-5.5 (verifier rejects 25% of cheap answers) and 30% from GPT-5.5 to Claude Opus (verifier rejects 30% of mid answers on the requests that already escalated).

Stage 1: 1,000,000 * $0.000224 = $224       (every request hits Flash)
Stage 1 verifier: 1,000,000 * $0.0002 = $200
Stage 2: 250,000 * $0.010      = $2,500     (25% escalate to GPT-5.5)
Stage 2 verifier: 250,000 * $0.0002 = $50
Stage 3: 75,000 * $0.042       = $3,150     (30% of those escalate to Opus)

Total cascade cost = $224 + $200 + $2,500 + $50 + $3,150 = $6,124 / month

That is an 85% reduction versus static Claude Opus 4.7, and a 39% reduction versus static GPT-5.5. The catch is the latency tail: 7.5% of requests pay all three stages of latency.

Mixture-of-Routers

MoR sends each request to the right model on the first try. With three routers voting and a context-aware aggregator, assume the routing distribution looks like:

60% trivial  -> 95% routed to DeepSeek V4 Flash, 5% to Pro
30% mid      -> 70% routed to DeepSeek V4 Pro, 25% to GPT-5.5, 5% to Opus
10% complex  -> 20% routed to GPT-5.5, 50% to Gemini 3.1 Pro, 30% to Opus

The expected per-request cost works out to:

Trivial bucket  (600,000 reqs): 0.95 * $0.000224 + 0.05 * $0.00279
                              = $0.000213 + $0.000140 = $0.000353
                              -> 600,000 * $0.000353 = $211.80

Mid bucket      (300,000 reqs): 0.70 * $0.00279 + 0.25 * $0.010 + 0.05 * $0.042
                              = $0.001953 + $0.0025 + $0.0021 = $0.006553
                              -> 300,000 * $0.006553 = $1,965.90

Complex bucket  (100,000 reqs): 0.20 * $0.010 + 0.50 * $0.0070 + 0.30 * $0.042
                              = $0.002 + $0.0035 + $0.0126 = $0.0181
                              -> 100,000 * $0.0181 = $1,810

Router overhead: 1,000,000 * $0.0001 = $100  (3 small routers, batched)

Total MoR cost = $211.80 + $1,965.90 + $1,810 + $100 = $4,087.70 / month

That is a 90% reduction versus static Claude Opus 4.7 and a 33% reduction versus the cascade. The win comes from avoiding the cascade's redundant Stage 1 work on requests that were always going to need GPT-5.5 or Opus.

MoR + Semantic Cache

Layering a 30% semantic-cache hit rate on top of MoR (a reasonable internal-tool number; public chat would be 10-15%):

Cached requests:    300,000 * $0.000020 = $6     (embedding + vector search)
Routed requests:    700,000 * $4,087.70/1,000,000 = $2,861.39
Cache infra cost:   ~$400 / month (vector store + embeddings)

Total MoR+cache = $6 + $2,861 + $400 = $3,267 / month

A 92% reduction versus static Claude Opus 4.7. At this point the gateway's own infrastructure is becoming a meaningful share of total cost, which is the inverse of the 2024 problem.

Cost summary at three traffic mixes

The 60/30/10 mix above is the canonical "consumer chat" mix. Two other mixes worth modeling: an "agentic" mix (30/40/30 - more complex work) and a "support" mix (80/15/5 - heavily skewed easy).

StrategyConsumer (60/30/10)Agentic (30/40/30)Support (80/15/5)
Static Claude Opus 4.7$42,000$42,000$42,000
Static GPT-5.5$10,000$10,000$10,000
Cascade$6,124 (-85%)$11,400 (-73%)$2,800 (-93%)
MoR$4,088 (-90%)$9,600 (-77%)$1,950 (-95%)
MoR + Semantic Cache (30%)$3,267 (-92%)$7,500 (-82%)$850 (-98%)

(All percentages versus static Claude Opus 4.7. Reduction widens on support traffic because the easy bucket dominates and a tiny model handles it.)

ASCII chart: cost reduction

Monthly Cost vs Static Claude Opus 4.7 (1M requests; mix 60/30/10)
Static Claude Opus 4.7    ##################################  $42,000   baseline
Static GPT-5.5            ########                            $10,000   -76%
Cascade                   #####                               $6,124    -85%
Mixture-of-Routers (MoR)  ###                                 $4,088    -90%
MoR + Semantic Cache      ##                                  $3,267    -92%
Source: Swfte MoR benchmarks, April 2026 (illustrative)

Latency Overhead per Technique

Cost is one axis; perceived latency is the other. Each technique has a distinct latency profile - mean, p95, and p99 all matter, but for interactive workloads p95 is the practical SLA.

TechniqueMean overheadp95 overheadp99 overheadUser-visible latency direction
Single Router+5-15 ms+15 ms+25 msSlightly worse
Cascade+0 ms (when Stage 1 succeeds)+400 ms (one escalation)+900 ms (two escalations)Worse on tail
Speculative-Race-200 to -600 ms-300 ms-500 msSignificantly better
MoR+10-25 ms+30 ms+45 msSlightly worse
Semantic-Cache hit-200 to -2000 msn/a (hits are ~10ms)n/aDramatically better
Semantic-Cache miss+5-10 ms+15 ms+25 msSlightly worse

Sources: Swfte production telemetry (April 2026), LMCouncil benchmarks. Numbers vary by region, model provider, and traffic shape.

ASCII chart: latency overhead

p95 Routing Overhead vs Direct Model Call (lower is better)
Speculative-Race (cache miss)  -300 ms  <--- faster
Semantic-Cache hit             ~10 ms total (not overhead - it replaces inference)
Single Router                  +15 ms   #
MoR                            +30 ms   ##
Cascade (1 escalation, p95)    +400 ms  ########################
Cascade (2 escalations, p99)   +900 ms  ##################################################
Source: Swfte MoR benchmarks, April 2026 (illustrative)

The takeaway is that cascades are not interactive. A cascade is a great fit for a backfill job, a batch summarization, or a background agent. It is a poor fit for a chat UI where p95 is a latency SLA. Speculative-race is the interactive answer when you want most of the cost benefit without the tail latency.

Decision Matrix by Workload Type

Choosing a technique is not about which is "best" - the answer depends on the workload. The table below maps common 2026 workloads to recommended techniques.

WorkloadRecommended primary techniqueRecommended secondaryAvoid
Consumer chat UISpeculative-Race + Semantic CacheSingle RouterCascade (tail latency)
Agentic coding (interactive)Speculative-RaceMoRCascade
Agentic coding (background)CascadeMoRStatic strong model
Customer support macrosSemantic Cache + Single RouterCascadeStatic strong model
Document summarization (batch)CascadeMoRSpeculative-Race (waste)
Structured-output extractionCascade with schema verifierMoRSpeculative-Race (waste)
Voice agent (real-time)Speculative-Race + Semantic CacheSingle RouterCascade
RAG / knowledge-base Q&AMoR + Semantic CacheSingle RouterStatic strong model
Long-context analysisMoR (accuracy-weighted)CascadeSpeculative-Race (cost)
Compliance / legal reviewStatic strong + MoR fallbackCascade with strong verifierCheap-only routing
Internal SQL / DSL generationCascade with execution verifierMoRStatic strong
Translation / localizationMoR + Semantic CacheSingle RouterCascade

The pattern is consistent: interactive workloads prefer speculative-race (latency wins), batch workloads prefer cascade (cost wins), and complex multi-objective workloads prefer MoR. Semantic cache is additive on every workload that has any prompt repetition.

For the broader strategic context on multi-model routing in 2026, see the aithority overview of multi-model routing and the LMSYS Arena leaderboard for current model rankings. For a Swfte-specific analysis of recent leaderboard movement, see LMSYS Arena Leaderboard May 2026.

Implementation Patterns and Gotchas

A handful of implementation details determine whether a routing layer is a 90% cost reduction in production or a 90% cost reduction on a dashboard while the actual bill stays flat. The list below is the consolidation of two dozen production deployments observed between Q1 2025 and Q2 2026.

Tokenizer drift. Different model families count tokens differently. A prompt that is 800 tokens in GPT-5.5's tokenizer might be 850 in Claude's and 920 in DeepSeek's. The router's cost estimate has to use the actual destination tokenizer, not a single canonical one, or the cost projections will be off by 5-15%.

Output-length skew. Cheap models often produce longer outputs than strong models for the same prompt - they are less concise. The output-token cost on a cheap-routed request can be 30-50% higher than naive math suggests. Calibrate from real traffic, not from input prompts alone.

Verifier asymmetry. The cascade verifier is the most underspecified component in most deployments. A loose verifier (e.g., "any non-empty answer passes") destroys the quality floor. A tight verifier (e.g., "schema-validated, three-self-consistency-pass") inflates verification cost above the savings. Tune the verifier on a held-out set with quality-graded ground truth.

Provider rate-limiting. When the cascade or MoR escalates a burst of requests to the strong model, the strong provider's per-minute rate limit becomes the bottleneck. Provision capacity at the burst rate, not the average rate, and have a fallback to a second strong provider (Gemini 3.1 Pro is the popular alternate to GPT-5.5).

Router model staleness. Routers trained on March 2026 traffic will degrade by July 2026 because the model landscape itself shifts - prices change, new models launch, capability rankings move. Schedule a quarterly retrain. The training data should come from the gateway's own logs, with a quality-judge model providing labels.

Streaming and cancellation. Speculative-race only works if both inference calls can be cancelled. Some providers charge for partial completions; some do not. Read the contracts carefully or the speculative cost will exceed the latency benefit.

Cold-start cost. A new tenant has no semantic-cache history and no router-calibration data. The first 10,000 requests from a new tenant are typically routed to a conservative default (the strong model) until enough traffic accumulates to calibrate. Budget for this.

Observability. Without per-technique logs, the gateway is opaque. At minimum, log per-request: the router's decision probability, the verifier's pass/fail, the cache hit/miss, the destination model, the actual cost, and the actual latency. Aggregate weekly to detect drift.

Composability: Stacking Techniques

The 2026 production gateway is rarely one technique. It is a composition. The Swfte Gateway implements MoR, Speculative Cascade, and Semantic-Cache as composable middleware - each technique is a layer that the request passes through, and the layers are reorderable per route.

A typical production stack looks like this:

request
  -> [auth + tenant resolution]
  -> [semantic cache lookup]      (if hit, return)
  -> [MoR aggregator]              (computes destination model)
  -> [speculative dispatcher]      (if interactive route)
       or [cascade dispatcher]     (if batch route)
  -> [verifier]
  -> [response]
  -> [semantic cache write]
  -> [observability log]

Each middleware has a clear contract: input is a routing context, output is either a final response (cache hit, accepted answer) or a routing decision plus a partial response. The composability makes it possible to A/B test individual techniques without touching the others - turn the semantic cache on for tenant A, leave it off for tenant B, compare cost and quality, decide.

This pattern is also why the multi-provider routing post is worth reading alongside this one - composability across providers is the same problem as composability across techniques.

What to Do This Quarter

Five concrete actions for engineering teams running an LLM gateway in Q2-Q3 2026. These are sequenced from cheapest-to-implement to most-impactful.

  1. Instrument the existing gateway. Before adding any technique, log per-request: input tokens, output tokens, latency, cost, destination model, and a quality proxy (verifier score, user feedback, or judge-model rating). Two weeks of clean logs is more valuable than two months of optimization on bad data.

  2. Add a semantic cache for internal traffic. Internal tools and API integrations have 40-70% repetition rates. A semantic cache typically pays back its infrastructure cost in the first week. Start with a 24-hour TTL and a 0.95 cosine-similarity threshold; tune from there.

  3. Implement a single-router baseline if you do not have one. A RouteLLM-style matrix-factorization router with two destination models (one cheap, one strong) is a 40-60% cost reduction on most traffic. It is also the easiest technique to debug and the foundation for MoR.

  4. Layer a cascade on batch and background routes. For any traffic class that does not have a sub-second SLA, a three-tier cascade with a cheap verifier is an additional 30-50% cost reduction on top of single-routing. Do not put cascades on interactive routes.

  5. Build the second and third routers for MoR. Once the single router is calibrated and the cascade is stable, train a latency-router and an accuracy-router. The aggregator can start as a static weighted vote and evolve to per-request weighting once you have enough labeled data.

  6. Add speculative-race on the interactive routes. With cost already reduced 70-80% from steps 2-4, you have headroom to spend a fraction of it on latency. Speculative-race on chat and voice cuts p95 by 30-50% and is usually a measurable conversion lift.

  7. Schedule a quarterly retrain. Add a calendar item: every 90 days, retrain the routers on the previous quarter's traffic. This is the single most-skipped step in production deployments and the one that quietly erodes savings over time.

The composition pattern - cache, then route, then cascade or speculate, then verify, then log - is the architecture that makes a 90% cost reduction durable rather than a one-time savings. The single-router gateway was the right architecture for 2024. The composed, multi-technique gateway is the right architecture for 2026.

Further Reading

0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.