Context efficiency is the central design concern in agent systems

Type: note · Status: current · Tags: computational-model, foundations

In traditional systems, the scarce resources are compute, memory, storage, and bandwidth; algorithmic complexity is the dominant cost model. In agent systems, the scarce resource is context — the finite window of tokens the agent can attend to. Context is not just another resource. It is the only channel through which an agent receives instructions, understands its task, accesses knowledge, and reasons toward action. A CPU has registers, cache, RAM, disk, and network as separate tiers. An LLM has one context window. Everything competes for the same space.

Context is the lowest-degree-of-freedom resource in agent systems: unitary within each inference call, impossible to tier at the attention level (though system architecture can build tiers around it), and hard to expand without architectural change. This is an application of solve low-degree-of-freedom subproblems first to avoid blocking better designs — optimize the tightest constraint before others, or later choices will be forced into low-quality tradeoffs.

The binding constraint is soft degradation, not hard token limits — established in agent context is constrained by soft degradation, not hard token limits. Hard limits are visible but rarely binding; the model degrades before hitting them. This note operationalizes that premise as a cost model and set of architectural responses.

Anthropic's engineering team has converged on the same framing, defining context engineering as "strategies for curating and maintaining the optimal set of tokens during LLM inference" and describing context as "a critical but finite resource" with an attention budget that "every token depletes" (Anthropic, 2025). Independent practitioner evidence comes from OpenAI's Codex team: shipping 1M lines of agent-generated code required a 100-line AGENTS.md as a router with pointers to deeper docs — "a map, not a manual" — because the bottleneck was not model capability but the structure of the environment — tools, feedback, and constraints, of which context structure is a central component (Lopopolo, 2026).

One property of the medium intensifies this scarcity: natural language has underspecified semantics with no enforced boundaries — not between instructions and data (homoiconicity), not between scopes, not between priority levels. Extra context doesn't just waste space — it can dilute instructions, contaminate scopes, and distort interpretation.

Prior work

Scarce attention as a central design constraint is well-established:

  • Attention economics (Simon, 1971) — "a wealth of information creates a poverty of attention." The context window is the literal implementation of this.
  • Working memory (Miller, 1956; Cowan, 2001) — limited capacity where everything competes for slots. Context windows are working memory for agents.
  • Information overload (Toffler, 1970) — too much information degrades decision quality, not just slows it.

The shared mechanism is structural: both human working memory and LLM context windows are fixed-capacity buffers where all content competes for influence on the next output. What's specific to agent systems is the unitary channel (one context window, no separate tiers), the hard token limit, and the interaction between volume and complexity.

TODO: This survey is from the agent's training data, not systematic. Revisit with deep search — Paulsen partially answers the degradation-curve question with task-dependent MECW measurements, but the broader attention-economics / working-memory literature and the optimal-loading-strategy question remain open.

Volume and complexity

The soft bound operates across two dimensions — volume (how many tokens) and complexity (how hard they are to use) — decomposed in agent context is constrained by soft degradation, not hard token limits. The dimensions are distinguishable but not fully separable; reducing volume often reduces complexity as a side effect. Most architectural responses affect both, but each has a primary target.

Practitioner evidence confirms the volume dimension as a primary concern: Koylan's Personal Brain OS reduced token usage for voice-only tasks by 40% by splitting merged modules into isolated scopes (Koylan, 2026) — a pure volume intervention. The key point for this note: large windows do not remove complexity costs, and raw token count alone does not predict usable context.

Growing windows address volume but not complexity

Nominal context windows have grown at roughly 30x per year since mid-2023 (Epoch AI, 2025). This addresses volume but does nothing for complexity. A five-level indirection chain is equally costly whether the window is 8K or 2M tokens.

Even for volume, the gains are partial. Context demand grows with task ambition — richer tool outputs, longer histories, more complex instructions. This is a Jevons paradox: efficiency gains get absorbed by expanding use cases.

Architectural responses

Context scarcity produces most architectural patterns in agent system design. Most responses affect both dimensions, but each has a primary target:

  • Frontloading and partial evaluation (primarily complexity) — pre-compute static parts so the agent receives answers instead of procedures to derive them
  • Progressive disclosure (both) — the instruction specificity principle matches instruction specificity to loading frequency; directory-scoped types load only when working in that directory. Reduces volume directly and complexity as a side effect — fewer loaded instructions means less scope contamination and fewer competing directives
  • Context management (primarily volume) — compaction, observation masking, and sub-agent delegation manage accumulation in long-running tasks (JetBrains Research, 2025)
  • Sub-agent isolation (both) — sub-agents provide lexically scoped frames with only what the caller explicitly passes, addressing volume and complexity simultaneously
  • Navigation design (primarily volume) — agents navigate by deciding what to read next; prose-as-title and retrieval-oriented descriptions let the agent decide "don't follow this" without loading the target
  • Instruction notes over data dumps (primarily complexity) — frontload the caller's judgment about which documents matter and what question to answer, rather than passing raw material

If context is the only fundamental scarce resource, then the natural computational model is symbolic scheduling over bounded LLM calls: exact bookkeeping lives in code, while bounded context is reserved for semantic judgment.

Context efficiency should be evaluated at design time, not treated as an optimization to apply later. Architectural choices — what loads when, what gets frontloaded, where sub-agent boundaries go — determine context efficiency structurally and are hard to retrofit.


Sources: - Anthropic (2025). Effective context engineering for AI agents. - JetBrains Research (2025). Cutting through the noise: smarter context management for LLM-powered agents. - Epoch AI (2025). LLMs now accept longer inputs, and the best models can use them more effectively. - Liu et al. (2023). Lost in the middle: how language models use long contexts. - Liu et al. (2026). ConvexBench: Can LLMs recognize convex functions? — empirical evidence that compositional depth, not token count, drives reasoning degradation. - Paulsen (2025). Context Is What You Need — The Maximum Effective Context Window — convergent evidence: MECW << MCW across 11 models, but tasks confound volume with LLM-hard exact enumeration, making this a volume × task-difficulty finding rather than pure volume degradation. - Lopopolo (2026). Harness engineering: leveraging Codex in an agent-first world — independent practitioner convergence on context-as-scarce-resource from 1M LOC agent-generated codebase. - Koylan (2026). Koylanai Personal Brain OS — 40% token reduction from module isolation demonstrates volume-dimension context efficiency.

Relevant Notes: