Scheduler-LLM separation exploits an error-correction asymmetry
Type: note · Status: seedling
The bounded-context orchestration model separates symbolic scheduling from bounded LLM calls. This note develops a conjecture about why the separation works, grounded in observations about error correction.
LLMs fail at bookkeeping
ConvexBench demonstrates: LLMs tracking compositional depth — pure bookkeeping — see F1 collapse from 1.0 to 0.2 at depth 100 despite using only 5,331 tokens. The window wasn't full. The bookkeeping failed silently. Moving depth tracking to a symbolic layer recovers F1 to 1.0 at all depths.
This is surprising. We expect machines to handle counting and state tracking reliably. Bookkeeping is what machines are good at — or so we assume.
Humans fail at bookkeeping too
But actually, humans exhibit the same failure. We cannot multiply large numbers in our heads, execute Towers of Hanoi algorithms mentally, or track deep recursion without external aids. We reach for pen and paper. Not because we can't reason — we can do sophisticated single-step judgments — but because our mental operations lack reliable intermediate state. The pen and paper provide checkable, correctable intermediate states that the mind does not.
The parallel is exact: LLMs, like humans, are powerful per-step reasoners that fail at extended bookkeeping. Both need an external substrate for reliable multi-step state tracking.
Symbolic systems work because they restore signals to discrete states
Why do symbolic systems — pen and paper, digital computers — succeed where minds and LLMs fail? Because they restore signals to discrete states at each step. A transistor doesn't need to be perfectly accurate — it just needs to be close enough that the signal can be snapped back to 0 or 1. Each operation has few valid states, so the system can detect and correct deviations before they propagate. Explicit error-correcting mechanisms like ECC memory, checksums, and parity bits are additional layers on top of this basic discretisation — they handle the cases where even the snap-to-discrete step might fail.
This is so fundamental we forget it's there. The reliability of digital systems isn't a property of the components — it's a property of the discrete-state restoration that happens at every step. Remove that restoration and digital systems become as unreliable as the analog components they're built from.
MAKER shows LLMs can do bookkeeping with error correction
MAKER (Meyerson et al., 2025) is the key evidence. It achieves zero errors over 1,048,575 Towers of Hanoi steps using LLMs — by constraining each step's output to a small space (which disk to move where) and applying first-to-ahead-by-k voting across independent samples. This is exactly the same error-correction principle that makes digital systems work: limit the output space, run redundant checks, detect and discard outliers.
MAKER proves that LLMs can do bookkeeping reliably — when the output space is constrained and error correction is applied. The capability is there. What's normally missing is the error correction.
But without that error correction, even clean separation degrades exponentially. And without clean separation, per-step reliability degrades with accumulated context, making error correction progressively more expensive. Both are necessary. (The synthesis-is-not-error-correction distinction matters here — MAKER uses voting, not synthesis.)
Semantic operations admit only weaker, more expensive error correction
Bookkeeping admits strong, cheap error correction because exact next states are available. A counter has one correct next value; a list append has one correct result. Hard oracles (exact equality checks) make voting trivially effective.
Rich semantic operations are not uncorrectable — the error correction framework explicitly allows softer checks like metamorphic tests, judge models, and cross-document consistency. But these oracles are weaker (smaller TPR - FPR gap), more expensive (each check costs an LLM call), and harder to decorrelate (LLMs share systematic biases). The result is that semantic error correction, where it works at all, requires bespoke techniques tailored to the specific task — there are no general methods analogous to the discrete-state restoration that makes symbolic bookkeeping universally reliable.
This asymmetry in oracle strength and cost is why the separation matters. Not because semantic work can't be error-corrected, but because bookkeeping error correction is so much cheaper that mixing the two forces bookkeeping onto the same weak, expensive stochastic substrate as semantics — wasting resources on reliability that a symbolic machine provides for free.
Separation enables error correction; substrate follows from cost
The RLM architecture provides a striking limit case. The LLM writes whole programs — about as sophisticated as a single-step operation gets. Yet the call stack for recursion still runs in the REPL, not in the LLM. Even an LLM powerful enough to write correct recursive programs delegates execution bookkeeping to the symbolic layer.
Once bookkeeping is separated, you could run it on LLMs with voting (as MAKER does) — but that buys you nothing. A symbolic machine does the same work cheaper, faster, and deterministically correct without needing redundancy. The separation is the essential move; the choice of substrate is just a cost optimization.
The conjecture, stated
The effectiveness of separating symbolic scheduling from bounded LLM calls reflects an asymmetry in oracle strength and error-correction cost:
- Bookkeeping has narrow exact states → symbolic substrates or hard-oracle voting make reliability cheap
- Semantic work may still be checkable, but usually only with soft oracles and harder decorrelation — reliability is expensive and has a lower ceiling
- Mixing them forces bookkeeping onto the same weak stochastic substrate as semantics, paying high costs for reliability that a symbolic machine provides for free
- Separation lets each kind of work use the strongest available reliability mechanism — hard oracles for bookkeeping, soft oracles (where worthwhile) for semantics
The boundary is not a hard possible/impossible line but a cost gradient: as the output space grows richer and oracles grow softer, error correction becomes progressively more expensive relative to the reliability gained.
Status and scope
This is conjectural. The evidence (ConvexBench, MAKER, RLM, the human parallel) is consistent, but a precise characterization of the boundary — exactly which operations are "constrained enough" — remains open. The bounded-context orchestration model and its predictions stand independently of why the asymmetry exists; this note offers a candidate explanation, not a dependency.
Relevant Notes:
- bounded-context-orchestration-model — foundation: the scheduling model whose effectiveness this note explains
- error-correction-works-above-chance-oracles-with-decorrelated-checks — foundation: the general theory of error correction (TPR > FPR, decorrelated checks) this note applies to the scheduling boundary
- synthesis-is-not-error-correction — extends: MAKER's success depends on voting (error correction), not synthesis; the aggregation operation must match the decomposition
- rlm-achieves-the-clean-scheduler-model-but-opts-out-of-accumulation — evidence: even LLMs powerful enough to write recursive programs delegate execution bookkeeping to the symbolic layer
- llm-mediated-schedulers-are-a-degraded-variant-of-the-clean-model — consequence: the degraded variant fails because it mixes bookkeeping with semantic operations, defeating error correction on both
- context-efficiency-is-the-central-design-concern-in-agent-systems — context: the complexity dimension of context cost is related but distinct; this note identifies error correction as the mechanism beneath the complexity problem
Topics: