Codify-versus-LLM decision heuristics

Type: kb/types/note.md · Status: current · Tags: learning-theory

Should this component be deterministic code or an LLM call? The KB has accumulated heuristics for this question from several angles. This note collects them.

Four lenses on the same decision

The KB offers at least four framings. They often agree in practice but ask different questions, and it's not obvious they reduce to a single criterion.

1. Spec completeness — is the spec a definition or a theory?

The fixed-artifact distinction draws the line. Exact specs fully capture the problem — the specification of multiplication is multiplication. Deterministic code is pure win. Proxy theories approximate the problem — "detect edges" was a plausible theory of what seeing requires, not a definition. The component can satisfy its local spec and still fail to compose into the target capability.

Confidence signals: - Is correctness fully specifiable? (definition → codify) - Is the spec a definition or a proxy metric? (proxy → leave for LLM) - Are failures local or compositional? (compositional → the specs are probably theories)

2. Oracle strength — how cheaply can you verify correctness?

The oracle strength spectrum turns the binary into a gradient:

Oracle Verification Codification fitness
Hard Exact, cheap, deterministic (tests, types, schema) Natural codification candidate
Soft Proxy score (BLEU, rubrics, heuristic checks) Partial — codify the proxy, leave the judgment
Interactive Feedback available (user edits, preference pairs) Extract deterministic rules from the feedback over time (spec mining)
Delayed Signal arrives later (user churn, bug surfaces) Resist codification until signal accumulates
No oracle Vibes and anecdotes Leave for LLM + human review

3. Interpretation space — does the spec admit one valid output or many?

The underspecification framing asks a different question. "Parse this YAML" has one correct output. "Refactor for readability" admits extract-helpers, rename-variables, restructure-control-flow, add-comments — all valid, qualitatively different.

Under this lens, codification is fundamentally about committing to one interpretation from a space the spec admits. Constraining narrows the space; codification collapses it to a point by crossing into executable code. The risk isn't wrong code — it's wrong commitment. If the problem genuinely has many valid interpretations, codifying one loses the others.

4. Pattern stability — has this emerged across enough runs?

The codification definition adds a temporal dimension: "codify when a pattern has emerged across enough runs that you can confidently commit." The spec mining workflow operationalizes this — watch the system, identify regularities, extract deterministic checks.

This lens treats codification as empirical. You don't decide a priori what to codify. You observe what the LLM does repeatedly the same way, and extract that. The agent always lowercases filenames and replaces spaces with underscores, so you extract sanitize_filename(). The pattern emerged; the codification followed.

The operational strategy is progressive constraining: start underspecified for flexibility, commit to precise semantics as patterns stabilize.

Do the lenses reduce to one?

The four correlate but may not share a root:

  • Spec completeness is about the nature of the problem
  • Oracle strength is about how you check the output
  • Interpretation space is about how many valid answers exist
  • Pattern stability is about temporal evidence

A tempting reduction: spec completeness → single valid interpretation → cheap verification → stable pattern. If the spec fully captures the problem, there's one answer, you can verify cheaply, and the pattern is trivially stable. This makes spec completeness look foundational and the others downstream.

But the chain breaks at the edges. Some problems have cheap verification yet admit multiple valid outputs — sorting has a unique answer, but "good variable names" has several valid options even with a testable rubric. Some problems have stable patterns despite weak oracles — the LLM always formats dates the same way, extractable as code, even though "good formatting" is loosely specified.

There's also a directionality question. The oracle-strength lens treats verification cost as the driver: you codify because you can verify. The interpretation-space lens inverts this: you can verify because the spec admits only one output. "2+2=4" is verifiable because arithmetic has one answer, not because we built a test for it. Which is cause and which is consequence?

The KB doesn't yet have a settled answer. This is worth investigating rather than prematurely closing.

Quick-reference checklist

The lenses above explain why these heuristics work. This section distills them into decision prompts.

Codify when: - The spec fully captures the problem — there's one correct answer - You can write a test that fully specifies correct behavior - The LLM does the same thing every time — the pattern has stabilized - The spec describes what (output properties) rather than how (process steps) - The operation is being re-discovered by the LLM on every run at token cost

Leave for LLM when: - The spec is a theory about the problem, not a definition of it - Correctness requires human judgment or proxy scores - The problem genuinely admits multiple valid interpretations - The pattern hasn't stabilized — you're still learning what the right behavior is - The constraint encodes how rather than what — process rather than outcome

Reverse a codification (relax) when: - Brittleness under paraphrase or reordering (relaxing signals) - Isolation-vs-integration gap — unit tests pass but integration fails - Growing exception lists and special cases - Distribution shift breaks the codified component - Composition failure — individually sound components don't compose into the target capability (strongest signal, most expensive to discover)

The hybrid case

Most real components are hybrids — part exact spec, part proxy theory. The practical move is to extract exact-spec subproblems into code and leave the rest for LLM.

The deterministic validation note is a worked example: most checks in /validate are hard-oracle (frontmatter structure, enum matching, link resolution → script) while the remaining few are soft-oracle (description quality, composability → stays in LLM skill).

AgeMem, an RL-trained memory management agent, shows the same split. Its memory operations (Add, Delete, Retrieve) are exact-spec artifacts: their specs fully capture what they do. But the composition policy (when to use which) is a proxy theory that benefits from RL training.

Three common mistakes

  1. Over-codifying. Encoding "always decompose agents into three phases" as a hard rule when it's a theory about what works. Process constraints are relaxing candidates — they encode how rather than what.

  2. Under-codifying. Running everything through an LLM, including checks where deterministic code would be faster, cheaper, and perfectly reliable. The validation example costs real tokens for zero gain on the hard-oracle checks.

  3. Static allocation. Treating the code/LLM split as a one-time design decision rather than a continuous cycle of codification and relaxing as understanding evolves.


Relevant Notes: