Evaluation automation is phase-gated by comprehension

Type: kb/types/note.md · Status: seedling · Tags: learning-theory, llm-interpretation-errors, evaluation, deploy-time-learning

When an evaluation loop improves score without improving real behavior, the failure is often not weak search but an objective grounded too weakly in observed failure. Evaluation automation in practice follows a characteristic sequence: comprehension first, specification second, generalization third.

Comprehension is the first gate because it supplies the observations that specification turns into verifiers. Before automation can improve output quality, the system needs direct evidence of real failures, a way to identify concrete failure modes, and a route for turning those failures into discriminative judges.

The three phases

Comprehension: Read outputs directly, observe where and why the system fails, build non-theoretical intuition for failure patterns.
Specification: Convert observations into a failure taxonomy and evaluators, then calibrate those evaluators against manually labeled examples.
Generalization: Run automated optimization against calibrated evaluators with broader input coverage.

This sequencing matches a practitioner pattern described in one detailed field report: auto-generated tests and judges produced early score gains, then degraded real quality exposed that the objective was wrong. The loop functioned correctly; the objective did not.

Why this is a gate, not a style preference

Skipping comprehension leaves specification unconstrained by observed reality. Skipping specification leaves optimization unconstrained by discriminative checks. Both cases amplify proxy quality rather than task quality.

This is why "more automation" cannot reliably substitute for the early verifier-construction work in cold-start or subjective domains. Automation can help once failure patterns and judges exist, but it cannot safely assume them from zero context.

Meta-Harness shows the important boundary condition. In hard-oracle domains, rich diagnostic access can automate part of comprehension: the proposer can inspect raw execution traces, prior harness code, and scores to infer why candidates failed before writing the next candidate. The phase gate is not "a human must always understand first." It is "optimization needs enough diagnostic access to form a causal failure model before it generalizes." Scores alone do not supply that model, and Meta-Harness's ablation suggests summaries may not preserve it either.

Scope limits

In hard-oracle domains (compilers, strict schemas, deterministic tests), comprehension can be shorter or partly automated when the proposer has rich diagnostic traces, not just scalar scores.
In soft-oracle domains (writing quality, strategic reasoning, product judgment), comprehension is load-bearing and usually human-led.
This claim applies to early and mid-stage system tuning. Mature systems may partially automate parts of comprehension, but only after prior manual cycles have stabilized the taxonomy.

Practical implication

Evaluation pipelines should enforce explicit verifier-construction stage gates before optimization:

Output-read pass completed on diverse inputs
Failure taxonomy written from observed failures
Judges calibrated on a hand-scored mini set

Without these gates, score improvements are weak evidence of capability improvement.

Relevant Notes:

spec-mining-as-codification — grounds: converting observed failures into reusable evaluators is spec mining
specification-strategy-should-follow-where-understanding-lives — extends: this is the evaluation-specific case where understanding emerges through observation, not upfront
the-boundary-of-automation-is-the-boundary-of-verification — narrows: identifies an intra-loop boundary where optimization depends on prior verifier construction
oracle-strength-spectrum — frames (provisional — target is speculative): the three phases can be read as a local oracle-hardening sequence before heavy automation
error-correction-works-above-chance-oracles-with-decorrelated-checks — enables: calibration ensures judges have discriminative signal before amplification
Ingest: Improving AI Skills with autoresearch & evals-skills — evidence: practitioner report where one team saw automation improve only after manual comprehension, taxonomy design, and judge calibration
Ingest: Meta-Harness — qualifies: rich raw traces let an automated proposer perform part of the comprehension phase in hard-oracle harness search

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search