Evaluation sandbox — ModelBreeder.com

Why descendants need a sandbox

A candidate is untrusted until evidence says otherwise. It may be low quality, misconfigured, over-permissioned, adversarially supplied, license-tainted, or simply incompatible. The evaluation sandbox exists to learn from candidates without letting them damage the evaluator, production systems, or source data.

Sandbox requirements

Requirement	Implementation direction
No production credentials	inject synthetic or scoped credentials only
Frozen test sets	read-only evaluator-owned data mounts
Network control	default deny with explicit allow-list
Filesystem isolation	scratch directory reset after each run
Resource limits	CPU, GPU, memory, time, and token caps
Audit logs	append-only traces with candidate digest
Deterministic replay	pinned seeds, runtimes, containers, and suites

Evaluation lifecycle

pseudocode

FUNCTION evaluate_candidate_in_sandbox(candidate, suite, policy)
    sandbox <- CREATE_SANDBOX(profile: candidate.risk_tier)
    MOUNT_READ_ONLY(sandbox, suite.test_data)
    MOUNT_READ_ONLY(sandbox, candidate.artifacts)
    APPLY_RESOURCE_LIMITS(sandbox, policy.resource_limits)
    DENY_NETWORK_BY_DEFAULT(sandbox)

    result <- RUN_SUITE(sandbox, suite)
    logs <- EXPORT_APPEND_ONLY_LOGS(sandbox)
    DESTROY_SANDBOX(sandbox)

    RETURN SIGN_SCORECARD(candidate.digest, suite.version, result, logs)
END FUNCTION

Avoid evaluator leakage

Do not expose hidden tests, scoring weights, or promotion thresholds to candidates. Public tests can be used for debugging, but promotion should include private or rotating slices owned by the evaluator.

Risk-tier profiles

Low-risk candidates may run with local-only tools and synthetic data. Medium-risk candidates need stricter network denial and stronger artifact scanning. High-risk candidates should run offline, with manual review before and after evaluation.

What not to sandbox away

The sandbox should not hide resource costs. If a candidate only works because the sandbox gives it unrealistic memory, latency, or data cleanliness, the evidence is misleading. Mirror production constraints closely enough to make decisions valid.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.

Core synthesisThe Four Fs of AI: Code Breeding, Model Breeding, and the Teleodynamic Convergence of Mutable Small-Model EcologiesConceptual synthesis · 80.5 KB Speculative risk scenariosAggressive Mutualism: Safety, Governance, and Containment AnalysisRisk analysis · 42.0 KB Speculative risk scenariosInstrumental Drives in Powerful AI SystemsRisk analysis · 42.2 KB Core synthesisThe 4Fs Framework: Fast, Flexible, Frugal, FederatedEmerging practice · 22.5 KB

Why descendants need a sandbox

Sandbox requirements

Evaluation lifecycle

Avoid evaluator leakage

Risk-tier profiles

What not to sandbox away

Source reports used for this guide

Related guides

Runtime isolation

Evaluator gates

Architecture

Reference architecture