Evaluation automation is phase-gated by comprehension
Type: note · Status: seedling · Tags: learning-theory, llm-interpretation-errors, evaluation
When an evaluation loop improves score without improving real behavior, the failure is often not weak search but an objective grounded too weakly in observed failure. Evaluation automation in practice follows a characteristic sequence: comprehension first, specification second, generalization third.
Comprehension is the first gate because it supplies the observations that specification turns into verifiers. Before automation can improve output quality, someone must inspect real outputs, identify concrete failure modes, and translate those failures into discriminative judges.
The three phases
- Comprehension: Read outputs directly, observe where and why the system fails, build non-theoretical intuition for failure patterns.
- Specification: Convert observations into a failure taxonomy and evaluators, then calibrate those evaluators against manually labeled examples.
- Generalization: Run automated optimization against calibrated evaluators with broader input coverage.
This sequencing matches a practitioner pattern described in one detailed field report: auto-generated tests and judges produced early score gains, then degraded real quality exposed that the objective was wrong. The loop functioned correctly; the objective did not.
Why this is a gate, not a style preference
Skipping comprehension leaves specification unconstrained by observed reality. Skipping specification leaves optimization unconstrained by discriminative checks. Both cases amplify proxy quality rather than task quality.
This is why "more automation" cannot reliably substitute for the early verifier-construction work in cold-start or subjective domains. Automation can help once failure patterns and judges exist, but it cannot safely assume them from zero context.
Scope limits
- In hard-oracle domains (compilers, strict schemas, deterministic tests), comprehension can be shorter because failure is already legible.
- In soft-oracle domains (writing quality, strategic reasoning, product judgment), comprehension is load-bearing and usually human-led.
- This claim applies to early and mid-stage system tuning. Mature systems may partially automate parts of comprehension, but only after prior manual cycles have stabilized the taxonomy.
Practical implication
Evaluation pipelines should enforce explicit verifier-construction stage gates before optimization:
- Output-read pass completed on diverse inputs
- Failure taxonomy written from observed failures
- Judges calibrated on a hand-scored mini set
Without these gates, score improvements are weak evidence of capability improvement.
Relevant Notes:
- spec-mining-as-codification — grounds: converting observed failures into reusable evaluators is spec mining
- specification-strategy-should-follow-where-understanding-lives — extends: this is the evaluation-specific case where understanding emerges through observation, not upfront
- the-boundary-of-automation-is-the-boundary-of-verification — narrows: identifies an intra-loop boundary where optimization depends on prior verifier construction
- oracle-strength-spectrum — frames (provisional — target is speculative): the three phases can be read as a local oracle-hardening sequence before heavy automation
- error-correction-works-above-chance-oracles-with-decorrelated-checks — enables: calibration ensures judges have discriminative signal before amplification
- Ingest: Improving AI Skills with autoresearch & evals-skills — evidence: practitioner report where one team saw automation improve only after manual comprehension, taxonomy design, and judge calibration