Programming practices apply to prompting

Type: note · Status: speculative

Much of what we do in llm-do and this knowledge base is applying established programming practices to prompts, documents, and LLM workflows. The transfers are practical and actionable. Two properties of LLM-based systems — semantic underspecification and execution indeterminism — make some practices harder than their traditional-programming originals, but in distinct ways.

Practices we apply

Typing. We assign types to documents to mark what operations they afford — a claim can be verified, a spec can be implemented, instructions can be followed. This is the same practice as typing values in code: the type determines valid operations. The verifiability criterion ensures types do real work — a type that doesn't enable specific operations is noise.

Progressive compilation. We stabilise LLM behaviour into code as patterns emerge — the same move as compiling: freezing a flexible representation into a rigid, efficient one. agentic systems interpret underspecified instructions frames this explicitly. The verifiability gradient maps the spectrum from prompt tweaks through evals to deterministic modules. Unlike compilation, stabilisation is projection from an underspecified spec — the natural-language specification admits multiple valid implementations, and committing to code means choosing one interpretation and fixing it in a language with precise semantics. Indeterminism adds noise on top (different runs may surface different interpretations), but the deeper operation is resolving the semantic ambiguity. The same pattern applies to methodology enforcement — written instructions compile into skills, then hooks, then scripts — with the added insight that not all methodology should complete the trajectory.

Testing. We test prompts and templates the way we test code. But LLM-based systems are harder to test, and the two phenomena create different challenges. Execution indeterminism means the same input produces different outputs across runs — you need statistical testing over distributions, not assertion equality. Semantic underspecification means the spec itself admits multiple valid interpretations — you need to test whether the instructions are sufficiently constraining, not just whether individual outputs look right. The first challenge requires running N times and checking the distribution; the second requires inspecting the spec for ambiguity, consistency, and sufficient constraint. The text testing pyramid sketches what this looks like concretely: deterministic checks at the base, LLM rubric grading in the middle, corpus compatibility at the top.

Version control. We version prompts, templates, and knowledge artifacts in git, treating them as source code. Storing a specific LLM output resolves the underspecification to a fixed interpretation — freezing one concrete value from the space the spec admits. Versioning the spec matters because regeneration is a new projection from the same underspecified spec — potentially a different interpretation, not a deterministic rebuild.

Design for testability. Crystallisation chooses repo artifacts as the substrate specifically because they're inspectable — any agent can diff, test, and verify them. Testability as a design property, applied to LLM output.

The hard cases

The difficulty is relative to the comparison domain. Compared to code (precise semantics), prompts are harder to test and verify. But compared to natural language specifications interpreted by humans, the challenges are familiar — legal drafting has centuries of methodology for managing the same underspecification. Execution indeterminism is genuinely novel; underspecification is not. The real difference is that humans accumulate judgment through experience — a developer reads the wiki, internalizes it, eventually transcends it. Agents are permanently stateless: always a newbie, always needing the skill loaded. The practices transfer, but what's a pedagogical convenience for humans is architectural necessity for agents.

Testing is the clearest example, and it doubles the testing surface — but the two halves come from different phenomena.

Indeterminism doubles the test runs. The same prompt produces different outputs across runs due to sampling, so you can't assert equality — you test the distribution. This requires statistical techniques (run N times, check pass rates, set confidence thresholds) where traditional code needs a single assertion.

Underspecification doubles the test targets. In deterministic code there's no gap between what the code says and what it does, so you only test outputs. With natural-language specs, the instructions admit multiple valid interpretations — you need to test the instructions themselves (are they consistent? unambiguous? sufficiently constraining?) as well as the outputs. A prompt that consistently produces unwanted behavior isn't exhibiting noise; it's exhibiting a valid interpretation you didn't intend. The fix is rewriting the spec, not retrying.

The two phenomena compound: you're testing an underspecified specification executed by an indeterministic engine. Each requires different techniques — statistical testing for indeterminism, structural analysis for underspecification — and conflating them leads to misdiagnosis.

Why the practices transfer

Both domains solve the same problems: making behaviour predictable, making systems composable, making artifacts verifiable. The underlying concepts (type theory, compilation, contracts) explain why a practice works in both settings. Thalo demonstrates the endpoint: a system that built a full compiler (Tree-Sitter grammar, LSP, 27 validation rules) for knowledge management, taking typing and testing to their logical extreme. Crystallisation systematises these transfers — the accumulated prompt adjustments, output post-processing, and workflow changes that every deployed system accumulates are exactly these programming practices applied informally. The motivation is practical — these are things we do, not abstractions we admire.

Open Questions

  • What other programming practices haven't been applied yet but could be? (Code review for prompts? Dependency injection for context? Refactoring patterns?)
  • Where do the practices break down — which ones mislead when applied to systems with underspecified instructions?
  • Can we develop prompt-native practices that have no programming equivalent?

Relevant Notes:

Topics: