Context efficiency is the central design concern in agent systems

Type: note · Status: current

In traditional systems, the scarce resources are compute, memory, storage, and bandwidth, and algorithmic complexity is the dominant cost model. In agent systems, the scarce resource is context — the finite window of tokens the agent can attend to. Context is not just another resource. It is the only channel through which an agent receives instructions, understands its task, accesses knowledge, and reasons toward action. A CPU has registers, cache, RAM, disk, and network as separate tiers. An LLM has one context window. Everything competes for the same space.

This is also an application of solve low-degree-of-freedom subproblems first to avoid blocking better designs. When context is both unitary and hard to expand, it is the tightest design constraint; optimizing for context first prevents later choices from being forced into low-quality tradeoffs.

Anthropic's engineering team has converged on the same framing, defining context engineering as "strategies for curating and maintaining the optimal set of tokens during LLM inference" and describing context as "a critical but finite resource" with an attention budget that "every token depletes" (Anthropic, 2025).

One property of the medium intensifies this scarcity: natural language has underspecified semantics with no enforced boundaries — not between instructions and data (homoiconicity), not between scopes, not between priority levels. Extra context doesn't just waste space — it can dilute instructions, contaminate scopes, and distort interpretation.

Two dimensions of context cost

Context efficiency is not only about how many tokens are in the window. It is also about what those tokens demand of the model. Conflating the two leads to architectural mistakes.

Volume: how many tokens

More tokens dilute attention. The "lost in the middle" finding (Liu et al., 2023) established primacy and recency bias. Anthropic calls this context rot — degradation in recall and reasoning as the window fills. The resource doesn't just run out; it degrades before it runs out.

Complexity: how hard the tokens are to use

Traditional systems execute instructions at constant cost. LLMs pay interpretation overhead proportional to context complexity. Giving an agent a procedure costs more than giving it the answer that procedure would have produced. Every layer of indirection costs context and interpretation overhead on every read.

ConvexBench (Liu et al., 2026) provides direct evidence: LLMs verifying composed functions collapse from F1=1.0 to F1≈0.2 at depth 100, despite using only 5,331 tokens — far below context limits. Scoped recursion (pruning history to retain only direct dependencies) recovers F1=1.0 at all depths, confirming the degradation is caused by flat accumulation, not the reasoning task itself.

The interaction

The two dimensions are not independent. High volume amplifies complexity costs. But they vary independently too: ConvexBench shows complexity-driven collapse at trivial token counts, and a long context of simple facts degrades differently from a short context of intricate procedures.

Growing windows address volume but not complexity

Nominal context windows have grown at roughly 30x per year since mid-2023 (Epoch AI, 2025). This addresses volume but does nothing for complexity. A five-level indirection chain is equally costly whether the window is 8K or 2M tokens.

Even for volume, the gains are partial. Context demand grows with task ambition — richer tool outputs, longer histories, more complex instructions. This is a Jevons paradox: efficiency gains get absorbed by expanding use cases.

Architectural responses

Context scarcity produces most architectural patterns in agent system design. Each responds to one or both dimensions:

  • Frontloading and partial evaluation (complexity) — pre-compute static parts so the agent receives answers instead of procedures to derive them
  • Progressive disclosure (volume) — the context loading strategy matches instruction specificity to loading frequency; directory-scoped types load only when working in that directory
  • Context management (volume) — compaction, observation masking, and sub-agent delegation manage accumulation in long-running tasks (JetBrains Research, 2025)
  • Sub-agent isolation (both) — sub-agents provide lexically scoped frames with only what the caller explicitly passes, addressing volume and complexity simultaneously
  • Navigation design (volume) — agents navigate by deciding what to read next; prose-as-title and retrieval-oriented descriptions let the agent decide "don't follow this" without loading the target
  • Instruction notes over data dumps (complexity) — frontload the caller's judgment about which documents matter and what question to answer, rather than passing raw material

If context is the only fundamental scarce resource, then the natural computational model is symbolic scheduling over bounded LLM calls: exact bookkeeping lives in code, while bounded context is reserved for semantic judgment.

Context efficiency should be evaluated at design time, not treated as an optimisation to apply later. Architectural choices — what loads when, what gets frontloaded, where sub-agent boundaries go — determine context efficiency structurally and are hard to retrofit.


Sources: - Anthropic (2025). Effective context engineering for AI agents. - JetBrains Research (2025). Cutting through the noise: smarter context management for LLM-powered agents. - Epoch AI (2025). LLMs now accept longer inputs, and the best models can use them more effectively. - Liu et al. (2023). Lost in the middle: how language models use long contexts. - Liu et al. (2026). ConvexBench: Can LLMs recognize convex functions? — empirical evidence that compositional depth, not token count, drives reasoning degradation.

Relevant Notes:

Topics: