Evolution Lab Advanced 2 minute read Updated 2026-06-26 UTC

Routing experiments

Design experiments that test whether routers, cascades, and coalitions improve net viability over single-model baselines.

Research statusExperiment design pattern Publication statePublished Reviewed byMichael Kappel Source reports3

The router is part of the organism-like system

A population of specialists is only useful if the router can select the right capability at the right cost. Routing experiments should evaluate the router and specialists together.

Baselines

Compare at least four conditions:

ConditionPurpose
Single championsimplest production baseline
Static rulesinterpretable routing without learned policy
Learned routeradaptive selection based on task features
Cascade or coalitionmultiple stages or multiple specialists

Metrics

Measure task quality, abstention quality, p50 and p95 latency, cost per request, escalation rate, wrong-route rate, confidence calibration, and incident rate. Do not accept a router that raises accuracy slightly while making failures harder to explain.

Experiment pseudocode

pseudocode
FUNCTION compare_routing_strategies(task_stream, strategies, policy)
    results <- []
    FOR strategy IN strategies
        replay <- REPLAY_TASK_STREAM(task_stream, strategy)
        score <- SCORE_REPLAY(replay, policy.metrics)
        results.APPEND({strategy: strategy.name, score: score})
    END FOR
    RETURN RANK_BY_NET_VIABILITY(results)
END FUNCTION

Wrong-route analysis

Wrong-route cases are especially valuable because they teach the ecology where capability contracts are ambiguous. Each wrong route should be labeled as classification error, missing capability, insufficient evidence, overload fallback, or contract mismatch.

Cascade design

A cascade should save cost on easy cases and escalate hard cases. It fails when early stages are overconfident. Always measure the quality of abstention and escalation, not only final answers.

Coalition design

A coalition should be used when independent specialists add value. It fails when multiple models repeat the same error, when judge models are weak, or when the latency budget cannot support parallel inference.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.