Theory Advanced 2 minute read Updated 2026-06-26 UTC

Evaluator independence

Why the selection machinery must be protected from the candidates it evaluates, and how to evolve evaluators safely.

Research statusSafety-critical design principle Publication statePublished Reviewed byMichael Kappel Source reports4

The central safety boundary

A model-breeding system fails if the candidate can change the rules that promote it. The candidate can generate outputs, proposals, and explanations. It cannot edit the evaluator, hide test cases, alter thresholds, change deployment policy, or decide that its own evidence is sufficient.

This boundary is the difference between controlled evolution and self-referential optimization.

Protected components

ComponentWhy it is protected
Test casesPrevents training to the exact gate by unauthorized access
Scoring weightsPrevents candidates from optimizing the easiest dimensions
Hard gatesPrevents safety, legal, and provenance failures from being averaged away
Evaluation runtimePrevents sandbox escape and timing manipulation
Evidence storePrevents deletion or rewriting of failed trials
Promotion policyPrevents popularity or self-advocacy from becoming approval

Evolving the evaluator

Evaluators may need to improve. That belongs to a separate code-breeding or governance process. The evaluator can have candidates too, but those candidates must be judged by a higher-level review rule, regression corpus, and human-owned approval path.

pseudocode
FUNCTION propose_evaluator_change(change, policy)
    REQUIRE change.origin != "candidate_model_self_edit"
    REQUIRE change.has_regression_suite
    REQUIRE change.has_human_owner
    REQUIRE change.has_rollback_plan

    shadow_scores <- SCORE_HISTORICAL_CANDIDATES_WITH(change)
    IF shadow_scores.flip_critical_decisions_without_explanation
        RETURN REJECT("Evaluator change destabilizes historical decisions")
    END IF

    RETURN APPROVE_FOR_GOVERNANCE_REVIEW(change)
END FUNCTION

Metric gaming signals

Look for sudden benchmark jumps without broad slice improvement, high self-confidence with poor calibration, evidence that improves only on public tests, repeated proposals to relax thresholds, or candidates that perform unusually well when inspected less strictly.

Independent evidence stores

Candidate artifacts should not be able to write directly to their scorecards. Evaluators produce scorecards, sign them, and store them append-only. Candidates may attach rebuttals or analysis, but they do not alter the record.

Human role

Humans do not need to review every low-risk candidate manually, but humans must own the evaluation constitution: what cannot be optimized away, who may change gates, how conflicts are resolved, and when automation stops.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.