Evaluation

Type: index · Status: current

What works, what doesn't, what needs testing. Empirical observations about KB operations, prompt design, and techniques from other systems.

Notes

cludebot — techniques from cludebot worth borrowing; richest trajectory-to-lesson loop reviewed
prompt-ablation-converts-human-insight-to-deployable-framing — methodology for testing prompt framings
systematic-prompt-variation-serves-verification-and-diagnosis-not-explanatory-reach-testing — controlled variation as a family of methods: decorrelating checks, measuring brittleness, and distinguishing both from Deutsch-style reach review
brainstorming-how-to-test-whether-pairwise-comparison-can-harden-soft-oracles — experimental ladder for comparing scalar and pairwise judges before treating pairwise ranking as a stronger soft oracle

Elicitation requires maintained question-generation systems — Four elicitation strategies ordered by user expertise required, composable into review architectures with maintenance loops that prevent ossification
Evaluation automation is phase-gated by comprehension — Optimization loops require manual error analysis and judge calibration before automation can improve behavior rather than just score
Knowledge storage does not imply contextual activation — Distinguishes stored knowledge (retrievable on direct probe) from contextually activated knowledge (brought to bear during task execution without being directly queried); formalizes the activation gap and the expertise gap