LLM context is composed without scoping

Type: kb/types/note.md · Tags: computational-model

An LLM's context is assembled by concatenating system prompts, skill bodies, user messages, and tool outputs into a single token stream. Everything is global: every token is visible to every other token, with no way to say "this binding is local to this skill" or "this tool output should not influence instruction interpretation."

This is not even dynamic scoping (name bindings resolved through the call stack rather than the source structure), which at least maintains a stack with push and pop. Flat concatenation is the homoiconic medium (instructions and data share one representation) with no structure imposed on top, yet it produces dynamic scoping's pathologies — and the Lisp analogy still clarifies them:

Spooky action at a distance. An early turn subtly biases a later response. The LLM has no mechanism to mark a binding as out of scope — once something enters the log, it influences everything downstream. This is the three-space memory claim's "operational debris pollutes search" failure mode, restated as a scoping problem.

Name collision. "Table" meant an HTML element in turn 3 but a database table in turn 12, and the model conflates them. A flat log has no scope boundaries to disambiguate — every use of a term sits in one namespace.

Inability to reason locally. You cannot predict what a sub-task will do by reading its prompt alone; its behavior depends on the entire accumulated history. This is the defining problem of dynamic scope: the meaning of a name depends on the call stack, not the definition site.

The capture problem

Flat concatenation creates a composition-specific problem: capture. A skill says "summarize the document." The document contains "don't summarize this section, skip it." The data-level use of "summarize" captures the instruction-level meaning. This is a hygiene failure that leads to prompt injection — the same problem Scheme's hygienic macros (macros that rewrite code without accidentally capturing names from the call site) solve for code generation.

Within-frame hygiene

Within a single context, the only scoping mechanisms available are weak conventions:

Role markers (system/user/assistant/tool in chat APIs) — primitive structural separation, but the LLM still sees all roles in one attention pass
Delimiters and quoting — XML tags, markdown fences, explicit "the following is data, not instructions" markers — conventional, not enforced
Ordering conventions — system prompt first, then context, then user message — exploits primacy/recency effects but provides no isolation

These are the LLM equivalent of coding conventions in a language without a module system. They help, but they cannot prevent capture — and they cannot disable non-selective semantic integration: prompt semantics the task contract does not license still steer generation, because every token shares one global attention field.

Non-selective semantic integration

"Spooky action at a distance" is measurable, not only architectural. GSM-DC varies synthetic distractor count in math word problems and finds power-law error growth — the clean control where irrelevant material is semantically inert noise. Gonen et al. varies injected concepts in completion prompts and finds Leak-Rate well above chance even when the concept is task-irrelevant (semantic leakage). Lampinen et al. varies belief-congruence on logic tasks. These studies are not independent interference axes; they stress the same flat-context failure under different doses and task grains. Benchmark labels (noise, association, content bias) describe what each experiment varied, not separate mechanisms requiring separate mitigations.

The realistic case — semantically linked material that should not govern the task — is what agent workflows encounter. Context contamination below compliance reasoning is that failure at agent dose: fine-grained stance drift despite expressed refusal. Counter-instructions can bias against integration; they cannot remove tokens from the window or make a scope boundary binding.

What flat context buys

Flat logs have a real upside: implicit communication. When a user says "use a more formal tone" in turn 5, the effect propagates to later turns without re-parameterizing. This ambient influence is what makes flat context ergonomic at single-call granularity. The design question is not whether to have the upside, but where to contain it.

The architectural response

The scoping problem is prose-specific. Symbolic artifacts (code, schemas, types) inherit scoping from their interpreter; distributed-parametric artifacts do not expose this kind of local prose scope question. Prose has nothing to inherit: no modules, no lexical scope, no interpreter-enforced boundaries. Scope can only be imposed architecturally.

At invocation time this surfaces as a design choice — flat (parent context) or bounded (sub-agent frame) — same representational form, same substrate, same authority path, different context-efficiency profile. Flat pays the full volume and complexity cost and risks contamination; bounded trades an interface cost for isolation.

Sub-agents are the canonical architectural move: code outside the LLM constructs a fresh flat context, the LLM sees only that, and the scope lives in the orchestration code rather than in the LLM itself.

This is one specialization of the general constraining argument in agentic systems interpret underspecified instructions — enforcement is the qualitative reason to move a property to code, distinct from the quantitative reasons (cost, latency, reliability). The error-profile version is scheduler-llm-separation exploits an error-correction asymmetry: bookkeeping has catastrophic error cost on the semantic substrate (the LLM) and zero error cost on the symbolic substrate (the surrounding code). Scope is bookkeeping, so it belongs on the symbolic side.

Empirical validation comes from ConvexBench (Liu et al., 2026), a benchmark for recognizing convexity in deeply composed symbolic functions: LLMs collapse from F1=1.0 to F1≈0.2 at depth 100, even though the total token count (~5,331) is trivial relative to the context window. The failure is compositional reasoning depth, not token capacity — each recursive step conditions on an expanding history that dilutes attention on the current step. Pruning to retain only direct dependencies at each sub-step (one clean frame per call) recovers F1=1.0 at all depths.

Sources: - Anthropic (2025). Effective context engineering for AI agents — recommends sub-agents return 1,000–2,000 token summaries; the tens of thousands of tokens each sub-agent explores stay out of the caller's window. Validates the lexically scoped frames pattern. - Yang et al. (2025). GSM-DC — power-law reasoning degradation under synthetic distractor count; the inert-noise control regime for non-selective integration. - Gonen et al. (2024/2025). Semantic leakage in language models — control/test Leak-Rate metric; instruction-tuned models leak more. - Lampinen et al. (2024). Content effects on reasoning tasks — belief-congruent content shifts logic-task accuracy across model families.

Relevant Notes:

llm context is a homoiconic medium — amplifies: the medium provides no structural boundaries, so scoping must be imposed by architecture
agent orchestration needs coordination guarantees, not just coordination channels — extends: scoping is one coordination guarantee family; without it, flat context fails by contamination rather than by inconsistency or amplification
three-space memory separation predicts measurable failure modes — exemplifies: the failure modes (search pollution, identity scatter, insight trapping) are symptoms of flat scoping applied to memory
agentic systems interpret underspecified instructions — grounds: prose has no deterministic interpreter, so scope guarantees — like other interpreter-enforced semantics — must be imposed via the constraining move to code; sub-agents are that move applied to scope
unified calling conventions enable bidirectional refactoring — existing approximation: llm-do's per-agent system prompts and arguments are frame-local context
codification — enables: frame boundaries are interface points where return values can be progressively typed
instruction specificity should match loading frequency — grounds: the loading hierarchy is a form of binding-time analysis for what's in scope
agent statelessness makes routing architectural, not learned — exemplifies: the routing tier separation is lexical scoping in practice
instructions are typed callables — enables: type signatures on skills are frame interfaces — declaring what bindings a sub-agent receives
agent statelessness means the context engine should inject context automatically — mechanism: automatic context injection constructs lexically scoped frames
topology, isolation, and verification form a causal chain for reliable agent scaling — extends: argues that scope isolation is the second prerequisite in a dependency chain, manufacturing the atomic units that verification needs
axes of artifact analysis — refines: the flat/bounded invocation choice is a prose-form refinement, orthogonal to the substrate/form/lineage/authority record but only applicable inside prose
scheduler-llm-separation exploits an error-correction asymmetry — grounds: scoping is bookkeeping, and bookkeeping belongs in the symbolic substrate — sub-agents are the canonical offload of prose-scoping to code
context contamination operates below an agent's compliance reasoning — exemplifies: non-selective integration at agent dose — stance drift despite detection and refusal
prose has no reliable dereference, so a declared fact must be reinforced where it applies — contrasts: the under-reach dual of this note's over-reach — distinct facets of underspecification (no boundaries vs no resolution), so neither grounds the other. But the flatness compounds the decay: with no boundaries the declared fact must win attention against the entire global stream, deepening its failure to reach a distant point of use

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search