Why descendants need a sandbox
A candidate is untrusted until evidence says otherwise. It may be low quality, misconfigured, over-permissioned, adversarially supplied, license-tainted, or simply incompatible. The evaluation sandbox exists to learn from candidates without letting them damage the evaluator, production systems, or source data.
Sandbox requirements
| Requirement | Implementation direction |
|---|---|
| No production credentials | inject synthetic or scoped credentials only |
| Frozen test sets | read-only evaluator-owned data mounts |
| Network control | default deny with explicit allow-list |
| Filesystem isolation | scratch directory reset after each run |
| Resource limits | CPU, GPU, memory, time, and token caps |
| Audit logs | append-only traces with candidate digest |
| Deterministic replay | pinned seeds, runtimes, containers, and suites |
Evaluation lifecycle
FUNCTION evaluate_candidate_in_sandbox(candidate, suite, policy)
sandbox <- CREATE_SANDBOX(profile: candidate.risk_tier)
MOUNT_READ_ONLY(sandbox, suite.test_data)
MOUNT_READ_ONLY(sandbox, candidate.artifacts)
APPLY_RESOURCE_LIMITS(sandbox, policy.resource_limits)
DENY_NETWORK_BY_DEFAULT(sandbox)
result <- RUN_SUITE(sandbox, suite)
logs <- EXPORT_APPEND_ONLY_LOGS(sandbox)
DESTROY_SANDBOX(sandbox)
RETURN SIGN_SCORECARD(candidate.digest, suite.version, result, logs)
END FUNCTIONAvoid evaluator leakage
Do not expose hidden tests, scoring weights, or promotion thresholds to candidates. Public tests can be used for debugging, but promotion should include private or rotating slices owned by the evaluator.
Risk-tier profiles
Low-risk candidates may run with local-only tools and synthetic data. Medium-risk candidates need stricter network denial and stronger artifact scanning. High-risk candidates should run offline, with manual review before and after evaluation.
What not to sandbox away
The sandbox should not hide resource costs. If a candidate only works because the sandbox gives it unrealistic memory, latency, or data cleanliness, the evidence is misleading. Mirror production constraints closely enough to make decisions valid.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.