Evaluation automation is phase-gated by comprehension

Type: kb/types/note.md · Status: seedling · Tags: learning-theory, llm-interpretation-errors, evaluation

When an evaluation loop improves score without improving real behavior, the failure is often not weak search but an objective grounded too weakly in observed failure. Evaluation automation in practice follows a characteristic sequence: comprehension first, specification second, generalization third.

Comprehension is the first gate because it supplies the observations that specification turns into verifiers. Before automation can improve output quality, the system needs direct evidence of real failures, a way to identify concrete failure modes, and a route for turning those failures into discriminative judges.

The three phases

  1. Comprehension: Read outputs directly, observe where and why the system fails, build non-theoretical intuition for failure patterns.
  2. Specification: Convert observations into a failure taxonomy and evaluators, then calibrate those evaluators against manually labeled examples.
  3. Generalization: Run automated optimization against calibrated evaluators with broader input coverage.

This sequencing matches a practitioner pattern described in one detailed field report: auto-generated tests and judges produced early score gains, then degraded real quality exposed that the objective was wrong. The loop functioned correctly; the objective did not.

Why this is a gate, not a style preference

Skipping comprehension leaves specification unconstrained by observed reality. Skipping specification leaves optimization unconstrained by discriminative checks. Both cases amplify proxy quality rather than task quality.

This is why "more automation" cannot reliably substitute for the early verifier-construction work in cold-start or subjective domains. Automation can help once failure patterns and judges exist, but it cannot safely assume them from zero context.

Meta-Harness shows the important boundary condition. In hard-oracle domains, rich diagnostic access can automate part of comprehension: the proposer can inspect raw execution traces, prior harness code, and scores to infer why candidates failed before writing the next candidate. The phase gate is not "a human must always understand first." It is "optimization needs enough diagnostic access to form a causal failure model before it generalizes." Scores alone do not supply that model, and Meta-Harness's ablation suggests summaries may not preserve it either.

Scope limits

  • In hard-oracle domains (compilers, strict schemas, deterministic tests), comprehension can be shorter or partly automated when the proposer has rich diagnostic traces, not just scalar scores.
  • In soft-oracle domains (writing quality, strategic reasoning, product judgment), comprehension is load-bearing and usually human-led.
  • This claim applies to early and mid-stage system tuning. Mature systems may partially automate parts of comprehension, but only after prior manual cycles have stabilized the taxonomy.

Practical implication

Evaluation pipelines should enforce explicit verifier-construction stage gates before optimization:

  1. Output-read pass completed on diverse inputs
  2. Failure taxonomy written from observed failures
  3. Judges calibrated on a hand-scored mini set

Without these gates, score improvements are weak evidence of capability improvement.


Relevant Notes: