Execution indeterminism is a property of the sampling process

Type: note · Status: seedling · Tags: llm-interpretation-errors

LLMs sample from probability distributions over tokens. The same prompt can produce different outputs across runs. This is a property of the execution engine — conceptually simpler than underspecification, and theoretically eliminable via deterministic decoding (temperature=0).

In practice, true determinism is hard to guarantee (floating-point non-determinism, batching effects, infrastructure changes) and may not be desirable — temperature > 0 helps explore reasoning paths, enables self-consistency techniques, and avoids degenerate repetitive outputs. All deployed systems exhibit indeterminism.

Why this matters as a distinct claim

Indeterminism is engineering noise — variation in how a chosen interpretation is executed, not variation in which interpretation is chosen. At temperature=0, the LLM still picks one interpretation from the space the spec admits; you just get the same one every time. This is why lowering temperature alone doesn't solve the "wrong interpretation" problem — it eliminates variation without ensuring the remaining interpretation is the one you wanted.

Counterintuitively, indeterminism obscures the deeper issue of underspecification. Because outputs vary across runs, people attribute the variation to randomness — "it's stochastic" — and reach for familiar tools: temperature tuning, retries, sampling strategies. This framework avoids confronting the real difference from traditional programming: that the specification language doesn't have precise semantics.

The remedy is sampling control: temperature adjustment, deterministic decoding, best-of-N selection. These address run-to-run variation but leave both underspecification and interpretation error untouched.


Relevant Notes:

Sources:

  • Ma et al. (2026). Prompt Stability in Code LLMs — cleanest empirical separation of indeterminism from underspecification: by varying prompt framing (emotion/personality) while holding task constant, they isolate the effect of interpretation choice from run-to-run sampling noise