Architecture Advanced 2 minute read Updated 2026-06-26 UTC

Evaluation sandbox

A controlled environment for testing model descendants without exposing production data, credentials, or evaluator integrity.

Research statusSecurity architecture pattern Publication statePublished Reviewed byMichael Kappel Source reports4

Why descendants need a sandbox

A candidate is untrusted until evidence says otherwise. It may be low quality, misconfigured, over-permissioned, adversarially supplied, license-tainted, or simply incompatible. The evaluation sandbox exists to learn from candidates without letting them damage the evaluator, production systems, or source data.

Sandbox requirements

RequirementImplementation direction
No production credentialsinject synthetic or scoped credentials only
Frozen test setsread-only evaluator-owned data mounts
Network controldefault deny with explicit allow-list
Filesystem isolationscratch directory reset after each run
Resource limitsCPU, GPU, memory, time, and token caps
Audit logsappend-only traces with candidate digest
Deterministic replaypinned seeds, runtimes, containers, and suites

Evaluation lifecycle

pseudocode
FUNCTION evaluate_candidate_in_sandbox(candidate, suite, policy)
    sandbox <- CREATE_SANDBOX(profile: candidate.risk_tier)
    MOUNT_READ_ONLY(sandbox, suite.test_data)
    MOUNT_READ_ONLY(sandbox, candidate.artifacts)
    APPLY_RESOURCE_LIMITS(sandbox, policy.resource_limits)
    DENY_NETWORK_BY_DEFAULT(sandbox)

    result <- RUN_SUITE(sandbox, suite)
    logs <- EXPORT_APPEND_ONLY_LOGS(sandbox)
    DESTROY_SANDBOX(sandbox)

    RETURN SIGN_SCORECARD(candidate.digest, suite.version, result, logs)
END FUNCTION

Avoid evaluator leakage

Do not expose hidden tests, scoring weights, or promotion thresholds to candidates. Public tests can be used for debugging, but promotion should include private or rotating slices owned by the evaluator.

Risk-tier profiles

Low-risk candidates may run with local-only tools and synthetic data. Medium-risk candidates need stricter network denial and stronger artifact scanning. High-risk candidates should run offline, with manual review before and after evaluation.

What not to sandbox away

The sandbox should not hide resource costs. If a candidate only works because the sandbox gives it unrealistic memory, latency, or data cleanliness, the evidence is misleading. Mirror production constraints closely enough to make decisions valid.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.