Scheduler-LLM separation exploits an error-correction asymmetry

Type: note · Status: speculative · Tags: computational-model, llm-interpretation-errors

The bounded-context orchestration model separates symbolic scheduling from bounded LLM calls. Context scarcity is one motivation for that separation. This note develops the other: an error-correction asymmetry grounded in the three phenomena that cause LLM output to deviate from intent.

The three phenomena affect bookkeeping and semantic work differently

LLM output deviates from intent through underspecification, indeterminism, and interpretation error (bias). Each affects bookkeeping and semantic operations differently.

Bookkeeping — counting, state tracking, list manipulation — can in principle be fully specified: given disk configuration X in Towers of Hanoi, there is exactly one correct next move. This eliminates underspecification as a theoretical source of error. What remains are indeterminism (variance) and interpretation error (bias).

In practice, when bookkeeping is done inside the LLM, even well-specified tasks show residual errors — from indeterminism and bias, not underspecification. ConvexBench shows LLMs tracking compositional depth see F1 collapse from 1.0 to 0.2 at depth 100 despite using only 5,331 tokens. MAKER faces similar residual errors on well-specified Towers of Hanoi steps (~99.8% per-step accuracy before correction). Both papers have tasks with unique correct answers — the errors are properties of the interpreter, not the spec. The whole point of separation is to move bookkeeping out of the LLM and onto a substrate where these problems don't arise.

Semantic operations — summarisation, reasoning, code generation — face all three simultaneously. Underspecification is inherent: the "correct" output is not unique. Indeterminism explores a space whose boundaries are set by underspecification. And bias adds systematic misses on top.

Symbolic systems eliminate all three for bookkeeping

Symbolic systems — pen and paper, digital computers — restore signals to discrete states at each step. A transistor just needs to be close enough to snap back to 0 or 1. This eliminates all three phenomena simultaneously: the input fully specifies the output (no underspecification), operations are deterministic (no indeterminism), and discrete-state restoration makes systematic misses impossible (no bias).

This is so fundamental we forget it's there. The reliability of digital systems isn't a property of the components — it's a property of the discrete-state restoration at every step.

Humans exhibit the same pattern

We cannot multiply large numbers in our heads or track deep recursion without external aids. We reach for pen and paper — not because we can't reason, but because our mental operations lack reliable intermediate state. LLMs, like humans, are powerful per-step reasoners that fail at extended bookkeeping. Both need an external substrate for reliable multi-step state tracking.

MAKER: error correction when bias is low

MAKER achieves zero errors over 1,048,575 Towers of Hanoi steps by addressing each phenomenon:

Eliminating effective underspecification: maximal decomposition ensures each LLM call sees only the current disk configuration — minimal, bounded context.
Correcting indeterminism: first-to-ahead-by-k voting across multiple samples at low temperature (0.1).
Keeping bias low: bounded context prevents the context-length-dependent bias that ConvexBench demonstrates. Per-step accuracy is ~99.8% — residual errors are mostly variance, not bias, so same-prompt voting works.

The critical insight: when bias is low, same-prompt sampling decorrelates errors because they're variance. When bias is high (the distribution itself is wrong), all voters draw from the same wrong distribution and agree on the wrong answer. Correcting bias requires prompt perturbation — different phrasings per voter — which is far more expensive. (The synthesis-is-not-error-correction distinction also matters — MAKER uses voting, not synthesis.)

Semantic operations resist cheap error correction

Bookkeeping admits cheap correction because underspecification is eliminable, hard oracles are available (exact equality checks), and bias stays low with bounded context.

Semantic operations face all three phenomena simultaneously. The error correction framework allows softer checks — metamorphic tests, judge models, cross-document consistency — but these oracles are weaker (smaller TPR - FPR gap), more expensive (each check costs an LLM call), and harder to decorrelate (LLMs share systematic biases from training). Semantic error correction requires bespoke techniques — there are no general methods analogous to discrete-state restoration.

Semantic error correction is possible in some cases, but expensive and domain-specific. Mixing bookkeeping with semantic work forces bookkeeping onto the same substrate, wasting resources on reliability that a symbolic machine provides for free.

There is an intermediate regime: systems like OpenProse use DSLs and explicit frame interfaces to recover scoping benefits before the scheduler moves into code. The parser and scheduler remain LLM-mediated, so the asymmetry argument still applies in principle — but how much practical reliability the intermediate regime actually delivers is an open empirical question. Specification-level separation recovers scoping before it recovers error correction develops this.

The conjecture, stated

The effectiveness of separating symbolic scheduling from bounded LLM calls reflects an asymmetry across all three phenomena:

Underspecification: eliminable for bookkeeping (the task has one correct answer), inherent for semantic work
Indeterminism: cheaply correctable via voting when bias is low (bookkeeping with bounded context); expensive when interacting with underspecification and bias (semantic work)
Bias: eliminable for bookkeeping by moving to symbolic substrate; persistent for semantic work, requiring expensive decorrelated correction

Separation addresses all three: symbolic substrates eliminate underspecification, indeterminism, and bias for bookkeeping; bounded LLM calls keep semantic work in the regime where per-step bias is low enough for cheap error correction. Mixing forces bookkeeping into accumulated context where bias grows, underspecification is reintroduced by context noise, and error correction becomes expensive. The boundary is a cost gradient, not a hard line.

Status and scope

This is conjectural. The evidence (ConvexBench, MAKER, RLM, the human parallel) is consistent, but a precise characterization of the boundary remains open.

The RLM architecture provides a striking limit case: the LLM writes whole programs, yet the call stack for recursion still runs in the REPL. Even an LLM powerful enough to write correct recursive programs delegates execution bookkeeping to the symbolic layer.

Relevant Notes:

bounded-context-orchestration-model — enables: the error-correction asymmetry is one of two arguments for the scheduling model
llm-interpretation-errors — foundation: the three-phenomena taxonomy (underspecification, indeterminism, bias) this note applies to the scheduling boundary
error-correction-works-above-chance-oracles-with-decorrelated-checks — foundation: the general theory of error correction (TPR > FPR, decorrelated checks) applied here
synthesis-is-not-error-correction — extends: MAKER's success depends on voting (error correction), not synthesis
rlm-has-the-model-write-ephemeral-orchestrators-over-sub-agents — evidence: even LLMs powerful enough to write recursive programs delegate execution bookkeeping to the symbolic layer
llm-mediated-schedulers-are-a-degraded-variant-of-the-clean-model — consequence: the degraded variant fails because it mixes bookkeeping with semantic operations, defeating error correction on both
specification-level-separation-recovers-scoping-before-it-recovers-error-correction — tension: OpenProse-like systems recover some benefits of separation without yet moving bookkeeping onto a hard-oracle substrate
context-efficiency-is-the-central-design-concern-in-agent-systems — context: the complexity dimension of context cost is related but distinct
topology, isolation, and verification form a causal chain for reliable agent scaling — extends: Tu's two-channel failure model (global drift vs. residual leaf errors) provides an alternative decomposition to this note's three-phenomena model

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search