Systematic prompt variation serves verification and diagnosis, not explanatory-reach testing

Type: note · Status: seedling · Tags: evaluation, llm-interpretation-errors, learning-theory

"Vary something and observe what changes" appears across multiple methodological contexts, but the underlying operations are distinct. This note groups the two main uses of controlled prompt variation as analysis:

  • verification — vary the prompt to create less-correlated checks or judges
  • diagnosis — vary the prompt while holding task semantics fixed to measure brittleness

Both vary what the model sees. Deutsch's explanatory-reach test does something different: it varies the explanation itself — change a premise and ask whether the conclusion changes predictably. That tests whether an idea captures causal structure. Prompt variation tests the behavior of the interpreter under alternative framings.

Verification: prompt variation as decorrelation machinery

In error correction works with above-chance oracles and decorrelated checks, "vary the prompt" is a way to manufacture independent signal from a soft oracle. A single model with a single framing shares the same bias across repetitions — naive voting just amplifies it. Rephrasing the question, changing framing, or applying metamorphic transformations breaks some of that correlation and makes aggregation more meaningful.

The primary success criterion here is not invariance but less-correlated signal. Disagreement across variants can be useful — the point is to avoid shared failure modes. Some verification methods, especially metamorphic checks, also use invariance as part of the signal ("if the answer changes under an equivalent transformation, something is wrong"). But the distinctive role of prompt variation in this section is that it creates multiple probes that do not all fail for the same reason.

Diagnosis: prompt variation as brittleness measurement

In operational signals that a component is a relaxing candidate, paraphrase and reordering tests are not trying to create independent judges. They ask whether a component is stable under semantically equivalent surface changes. The PromptSE ingest makes this concrete: emotion and personality prompt variants preserve the task while changing expression style, so performance shifts are interpreted as prompt sensitivity, not as evidence from multiple judges.

The success criterion here is invariance. If the task is unchanged, large output swings indicate the system is tracking surface cues instead of underlying structure — a diagnostic signal that a component is overfit to prompt format rather than task specification.

Deutsch's reach test varies the explanation, not the prompt

The reach note uses "can you vary it?" in a different sense. Deutsch's test asks: Can you change a premise in the explanation? Can you predict what changes in the conclusion? Does that reveal causal structure that transfers beyond the original case?

This is a quality test for ideas, not for model behavior. The desired result is structured sensitivity: if the explanation captures mechanism, changing one premise should change downstream predictions in an intelligible way. Neither stability under paraphrase nor decorrelated disagreement is the goal.

The three operations separate cleanly:

Operation What is varied What is held fixed What counts as success
Verification via prompt variation framing of the check the candidate answer and evaluation target less-correlated signal
Diagnosis via prompt variation surface form of the task task semantics stable behavior under equivalent variants
Reach testing premises in the explanation the standards of criticism predictable downstream changes

All three use controlled variation to learn something invisible from a single run, but the interpretation logic differs. Treating them as one method obscures what each result means.

Prompt ablation is adjacent but distinct

Prompt ablation converts human insight into deployable agent framing is a fourth nearby use: vary prompt framing against a known target to find which framing reliably elicits the desired reasoning. This is closest to optimization/search. It uses a hard target like verification, but the goal is not decorrelation. It measures behavioral robustness like diagnosis, but only relative to one human-verified finding. Prompt ablation selects a framing — it is not testing reach or classifying brittleness.

Why the distinction matters

Without this separation, the results of prompt variation are easy to misread:

  • A diagnostic test could be mistaken for evidence aggregation, when disagreement actually signals brittleness.
  • A verification setup could be misjudged as instability, when disagreement is exactly what creates independent signal.
  • A reach test could be reduced to paraphrase robustness, when the point is to vary the mechanism and predict the consequences.

The common meta-pattern is controlled variation as an epistemic tool. The object of variation determines the epistemic role:

  • vary the prompt → learn about model robustness or judge correlation
  • vary the explanation → learn whether the idea has mechanistic reach

Relevant Notes: