Context efficiency is the central design concern in agent systems
Type: note · Status: current · Tags: computational-model, foundations
In traditional systems, the scarce resources are compute, memory, storage, and bandwidth; algorithmic complexity is the dominant cost model. In agent systems, the scarce resource is context — the finite window of tokens the agent can attend to. Context is not just another resource. It is the only channel through which an agent receives instructions, understands its task, accesses knowledge, and reasons toward action. A CPU has registers, cache, RAM, disk, and network as separate tiers. An LLM has one context window. Everything competes for the same space.
Context is the lowest-degree-of-freedom resource in agent systems: unitary within each inference call, impossible to tier at the attention level (though system architecture can build tiers around it), and hard to expand without architectural change. This is an application of solve low-degree-of-freedom subproblems first to avoid blocking better designs — optimize the tightest constraint before others, or later choices will be forced into low-quality tradeoffs.
The binding constraint is soft degradation, not hard token limits — established in agent context is constrained by soft degradation, not hard token limits. Hard limits are visible but rarely binding; the model degrades before hitting them. This note operationalizes that premise as a cost model and set of architectural responses.
Anthropic's engineering team has converged on the same framing, defining context engineering as "strategies for curating and maintaining the optimal set of tokens during LLM inference" and describing context as "a critical but finite resource" with an attention budget that "every token depletes" (Anthropic, 2025). Independent practitioner evidence comes from OpenAI's Codex team: shipping 1M lines of agent-generated code required a 100-line AGENTS.md as a router with pointers to deeper docs — "a map, not a manual" — because the bottleneck was not model capability but the structure of the environment — tools, feedback, and constraints, of which context structure is a central component (Lopopolo, 2026).
One property of the medium intensifies this scarcity: natural language has underspecified semantics with no enforced boundaries — not between instructions and data (homoiconicity), not between scopes, not between priority levels. Extra context doesn't just waste space — it can dilute instructions, contaminate scopes, and distort interpretation.
Prior work
Scarce attention as a central design constraint is well-established:
- Attention economics (Simon, 1971) — "a wealth of information creates a poverty of attention." The context window is the literal implementation of this.
- Working memory (Miller, 1956; Cowan, 2001) — limited capacity where everything competes for slots. Context windows are working memory for agents.
- Information overload (Toffler, 1970) — too much information degrades decision quality, not just slows it.
The shared mechanism is structural: both human working memory and LLM context windows are fixed-capacity buffers where all content competes for influence on the next output. What's specific to agent systems is the unitary channel (one context window, no separate tiers), the hard token limit, and the interaction between volume and complexity.
TODO: This survey is from the agent's training data, not systematic. Revisit with deep search — Paulsen partially answers the degradation-curve question with task-dependent MECW measurements, but the broader attention-economics / working-memory literature and the optimal-loading-strategy question remain open.
Volume and complexity
The soft bound operates across two dimensions — volume (how many tokens) and complexity (how hard they are to use) — decomposed in agent context is constrained by soft degradation, not hard token limits. The dimensions are distinguishable but not fully separable; reducing volume often reduces complexity as a side effect. Most architectural responses affect both, but each has a primary target.
Practitioner evidence confirms the volume dimension as a primary concern: Koylan's Personal Brain OS reduced token usage for voice-only tasks by 40% by splitting merged modules into isolated scopes (Koylan, 2026) — a pure volume intervention. The key point for this note: large windows do not remove complexity costs, and raw token count alone does not predict usable context.
Growing windows address volume but not complexity
Nominal context windows have grown at roughly 30x per year since mid-2023 (Epoch AI, 2025). This addresses volume but does nothing for complexity. A five-level indirection chain is equally costly whether the window is 8K or 2M tokens.
Even for volume, the gains are partial. Context demand grows with task ambition — richer tool outputs, longer histories, more complex instructions. This is a Jevons paradox: efficiency gains get absorbed by expanding use cases.
Architectural responses
Context scarcity produces most architectural patterns in agent system design. Most responses affect both dimensions, but each has a primary target:
- Frontloading and partial evaluation (primarily complexity) — pre-compute static parts so the agent receives answers instead of procedures to derive them
- Progressive disclosure (both) — the instruction specificity principle matches instruction specificity to loading frequency; directory-scoped types load only when working in that directory. Reduces volume directly and complexity as a side effect — fewer loaded instructions means less scope contamination and fewer competing directives
- Context management (primarily volume) — compaction, observation masking, and sub-agent delegation manage accumulation in long-running tasks (JetBrains Research, 2025)
- Sub-agent isolation (both) — sub-agents provide lexically scoped frames with only what the caller explicitly passes, addressing volume and complexity simultaneously
- Navigation design (primarily volume) — agents navigate by deciding what to read next; prose-as-title and retrieval-oriented descriptions let the agent decide "don't follow this" without loading the target
- Instruction notes over data dumps (primarily complexity) — frontload the caller's judgment about which documents matter and what question to answer, rather than passing raw material
If context is the only fundamental scarce resource, then the natural computational model is symbolic scheduling over bounded LLM calls: exact bookkeeping lives in code, while bounded context is reserved for semantic judgment.
Context efficiency should be evaluated at design time, not treated as an optimization to apply later. Architectural choices — what loads when, what gets frontloaded, where sub-agent boundaries go — determine context efficiency structurally and are hard to retrofit.
Sources: - Anthropic (2025). Effective context engineering for AI agents. - JetBrains Research (2025). Cutting through the noise: smarter context management for LLM-powered agents. - Epoch AI (2025). LLMs now accept longer inputs, and the best models can use them more effectively. - Liu et al. (2023). Lost in the middle: how language models use long contexts. - Liu et al. (2026). ConvexBench: Can LLMs recognize convex functions? — empirical evidence that compositional depth, not token count, drives reasoning degradation. - Paulsen (2025). Context Is What You Need — The Maximum Effective Context Window — convergent evidence: MECW << MCW across 11 models, but tasks confound volume with LLM-hard exact enumeration, making this a volume × task-difficulty finding rather than pure volume degradation. - Lopopolo (2026). Harness engineering: leveraging Codex in an agent-first world — independent practitioner convergence on context-as-scarce-resource from 1M LOC agent-generated codebase. - Koylan (2026). Koylanai Personal Brain OS — 40% token reduction from module isolation demonstrates volume-dimension context efficiency.
Relevant Notes:
- solve low-degree-of-freedom subproblems first to avoid blocking better designs — application: this note treats context as the lowest-degree-of-freedom resource and derives architecture priorities from that constraint
- agent context is constrained by soft degradation, not hard token limits — grounds: establishes the binding-constraint premise and two-dimensions decomposition this note operationalizes as architectural responses
- frontloading spares execution context — mechanism: the most direct response to complexity-dimension context cost
- indirection is costly in LLM instructions — mechanism: the cost model that makes indirection expensive in context but free in code
- instruction specificity should match loading frequency — application: progressive disclosure as a response to volume-dimension context cost
- LLM context is composed without scoping — foundation: flat context means everything competes; sub-agents are the scoping response
- agents navigate by deciding what to read next — application: navigation design as volume-saving strategy
- directory-scoped types are cheaper than global types — application: type system designed around context economy
- generate instructions at build time — application: build-time generation as frontloading applied to skill templates
- effective context is task-relative and complexity-relative not a fixed model constant — sharpens: makes explicit why usable context cannot be treated as a single per-model number
- LLM context is a homoiconic medium — intensifies: instructions and data compete as equal tokens with no priority mechanism
- agentic systems interpret underspecified instructions — intensifies: extra context distorts interpretation, not just wastes space
- Minimum Viable Ontology / Domain Maps — exemplifies: MVO is distillation under context-efficiency pressure — compress domain knowledge into the smallest vocabulary that fits the context window
- Harness Engineering (Lopopolo, 2026) — exemplifies: "give Codex a map, not a 1,000-page instruction manual" is independent practitioner convergence on context scarcity as the binding constraint
- Harness Engineering as Cybernetics (@odysseus0z, 2026) — grounds: frames context-efficient agent runtime design as feedback-loop calibration from control theory — the shift from direct code production to sensor-actuator design is the same shift this note identifies as moving from capability to context structure
- The Anatomy of an Agent Harness (Vtrivedy10, 2026) — exemplifies: derives runtime components (filesystem, sandbox, context management, skills) by working backwards from model limitations — instantiates the architectural responses this note describes abstractly with concrete primitives (compaction, tool-call offloading, progressive tool loading)
- Epiplexity (Bates et al., 2026) — grounds: formalizes the complexity dimension of context cost — epiplexity quantifies structural accessibility under computational bounds, giving theoretical backing to "how hard the tokens are to use"
- AgeMem (Yu et al., 2025) — exemplifies: RL-trained STM operations (Retrieve/Summary/Filter) achieve 3-5% token reduction while maintaining performance — empirical evidence that learned context management can outperform heuristic approaches