The score is not just accuracy
A candidate descendant should be evaluated by net viability, not by a single benchmark. Accuracy can rise while system viability falls because the candidate adds too much latency, memory, risk, operational complexity, or evaluator fragility. The viability function exists to stop that failure.
A practical score uses normalized deltas against the current production baseline:
| Symbol | Meaning | Direction |
|---|---|---|
| Delta U | Utility or task quality improvement | Higher is better |
| Delta R | Robustness, calibration, and abstention improvement | Higher is better |
| Delta D | Useful behavioral diversity and error decorrelation | Higher is better |
| Delta C | Task, language, modality, or environment coverage | Higher is better |
| Delta M | Memory, storage, and model-loading overhead | Lower is better |
| Delta L | End-to-end latency and tail latency overhead | Lower is better |
| Delta E | Energy, compute, and evaluation cost | Lower is better |
| Delta S | Security, safety, legal, and provenance risk | Lower is better |
| Delta K | Maintenance complexity and coordination burden | Lower is better |
Normalized viability
The score should be dimensionless. Normalize each dimension to a comparable scale before weighting it. Do not let a benchmark with easy units dominate the decision because it happens to produce larger numbers.
FUNCTION viability(candidate, baseline, weights)
benefits <- 0
benefits += weights.utility * NORMALIZE(candidate.utility - baseline.utility)
benefits += weights.robustness * NORMALIZE(candidate.robustness - baseline.robustness)
benefits += weights.diversity * NORMALIZE(candidate.diversity_contribution)
benefits += weights.coverage * NORMALIZE(candidate.coverage_gain)
costs <- 0
costs += weights.memory * NORMALIZE(candidate.memory_cost - baseline.memory_cost)
costs += weights.latency * NORMALIZE(candidate.latency_cost - baseline.latency_cost)
costs += weights.energy * NORMALIZE(candidate.energy_cost - baseline.energy_cost)
costs += weights.risk * NORMALIZE(candidate.risk_delta)
costs += weights.complexity * NORMALIZE(candidate.complexity_delta)
RETURN benefits - costs
END FUNCTIONHard gates come before arithmetic
Some properties should not be averaged away. If a candidate fails a license gate, exposes a credential path, lacks a rollback target, violates a safety invariant, or uses unapproved data, the candidate fails even if its numeric score is high.
FUNCTION decision(candidate, baseline, policy)
IF NOT HARD_GATES_PASS(candidate, policy)
RETURN REJECT("Hard gate failure")
END IF
score <- viability(candidate, baseline, policy.weights)
IF score >= policy.promote_threshold
RETURN PROMOTE_WITH_CANARY(candidate, score)
END IF
IF score >= policy.archive_threshold
RETURN ARCHIVE_AS_STEPPING_STONE(candidate, score)
END IF
RETURN NO_OP("Insufficient net viability")
END FUNCTIONThresholds are environment-dependent
A browser-edge deployment should heavily penalize memory and tail latency. A batch research workflow may tolerate slower inference if the result improves coverage or robustness. A regulated workflow should overweight provenance, auditability, and conservative rollback.
Retention score
Viability also applies to existing modules. A module that was once valuable can become a liability after workload shifts, hardware changes, or better descendants arrive.
FUNCTION retention_score(module, observed_window, policy)
contribution <- MEASURE_MARGINAL_CONTRIBUTION(module, observed_window)
burden <- MEASURE_RUNNING_BURDEN(module, observed_window)
risk <- MEASURE_CURRENT_RISK(module)
RETURN contribution - burden - risk
END FUNCTIONA module with negative retention score is not punished. It is retired, compressed, or moved to a cold archive so the population can remain frugal.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.