Spec mining is codification's operational mechanism

Type: note · Status: current · Tags: learning-theory

Codification says knowledge hardens into repo artifacts — tests, specs, conventions. But where do those artifacts come from? One answer: you mine them from observed behavior.

The pattern

  1. Watch the system do tasks (or watch humans do tasks the system will do).
  2. Identify repeated micro-actions: parsing dates, normalising names, mapping intents to actions, detecting escalation triggers.
  3. Extract those regularities into deterministic artifacts: functions, schema rules, unit tests, checkers.
  4. Re-run with these constraints in place. The system becomes more reliable without weight updates.

This is codification as compilation: the system distills stochastic regularities into deterministic code. The output is an inspectable substrate — reviewable, testable, revertable artifacts rather than opaque weight updates. Inspectability is what makes mined specs falsifiable: you can test them under distribution shift and relax them back if they break.

The same pattern appears at the methodology level: the maturation trajectory from instruction to script is spec mining applied to methodology rather than system behavior. The codification trigger ("a pattern has emerged from repeated execution") is the same observation step.

Why this matters for the bitter lesson boundary

The bitter lesson boundary says calculators survive scaling because the spec is the problem. Spec mining manufactures new calculators by discovering specs that were implicit in behavior. Each mined spec converts a piece of the blurry zone into the calculator regime.

This connects to the oracle strength spectrum: spec mining moves components from soft/delayed oracle toward hard oracle. A pattern that was only checkable by "does the output look right?" becomes checkable by "does this match the extracted rule?" Each mined spec is also a new oracle that error correction can amplify through decorrelated checks — the progression is: mine a spec (create an oracle with TPR > FPR), then amplify through decorrelated repetition. This design philosophy — out-evaluate, not out-implement — is what the harness engineering as cybernetics thread calls "externalizing system-specific judgment."

Concrete workflow

For an agentic system: 1. Cluster failure modes from production logs. 2. For the top clusters, ask: is there a deterministic rule that would have caught this? 3. If yes → write a verifier or deterministic helper (codify). 4. If no → the failure mode stays in the learned regime, but you now have a regression test (partial codification). 5. Repeat. The calculator surface grows monotonically.

The Codex team's harness engineering report (Lopopolo, 2026) documents this workflow at production scale. Early on, 20% of engineering time (Fridays) went to manually cleaning "AI slop" — observing failure patterns. The team then codified those observations into structural tests and linter rules whose error messages teach the fix, and finally automated the observation step itself with background cleanup agents that scan for drift and open refactoring PRs. The progression — manual observation, extracted rules, automated monitoring — is the spec mining loop completing.

Risks

  • Mining specs from observed behavior can encode biases or accidents as rules. The mined spec might be a "vision feature" — a plausible theory that scale will eventually outperform.
  • Mitigation: mined specs should be falsifiable. If they break under distribution shift or metamorphic testing, they're candidates for relaxing, not permanent codification. The relaxing-signals note identifies the specific indicators (paraphrase sensitivity, distribution-shift brittleness) that reveal a mined spec was a vision feature, not a calculator.

Open questions

  • What's the right threshold for codifying a mined pattern? Too early and you lock in a vision feature; too late and you miss easy reliability wins.
  • Can spec mining be automated? LLMs could propose candidate rules from failure clusters, then validation suites confirm or reject them. The automating KB learning note explores a related version: the "boiling cauldron" mutations (extract, relink, synthesise) are spec mining applied to knowledge structure rather than system behavior.

Sources: - Lopopolo (2026). Harness engineering: leveraging Codex in an agent-first world — production-scale spec mining: manual failure observation → structural tests → automated cleanup agents.

Relevant Notes: