Effective context is task-relative and complexity-relative not a fixed model constant

Type: note · Status: seedling · Tags: computational-model, foundations

How much context an LLM can actually use is not a fixed property of the model. It depends on the task and on the prompt's effective difficulty for the model. A model may handle a large window for one task shape and fail at a much smaller window for another. Two prompts at similar token counts may consume very different amounts of effective budget — one compositionally shallow and cleanly framed, the other requiring deep structured reasoning or burying the relevant information in a harder-to-use presentation.

Three independent sources converge on this:

Volume varies by task type. Paulsen (2025) measures Maximum Effective Context Window (MECW) across 11 frontier models and finds it far smaller than advertised limits. Crucially, the threshold shifts by problem type. This rejects the common simplification that a model has one stable "usable context length."

Complexity can dominate volume. ConvexBench (Liu et al., 2026) shows performance collapsing with compositional depth at just 5,331 tokens — far below nominal limits — then recovering when recursive steps get focused local frames. Token count alone does not determine whether a prompt is usable.

Irrelevant context degrades effective capacity at fixed task difficulty. GSM-DC (Yang et al., 2025) constructs math problems as symbolic DAGs, then injects distractor nodes while holding the solution path fixed. Error scales as a power law with distractor count, and the exponent grows with reasoning depth (delta from 0.11 at depth 2 to 0.49 at depth 5). This is the clean empirical regime the first open question below asks for: volume (distractor count) varies independently of task difficulty (solution path unchanged), and the degradation is measurable. Critically, distractors degrade both reasoning path selection and arithmetic execution independently — two distinct channels through which irrelevant context reduces effective capacity.

The synthesis is effective context is relational: model choice matters, task type matters, and prompt difficulty changes the effective cost of a prompt. This is weaker and cleaner than treating MECW as a single parameterized scalar MECW(model, task_type, complexity). In the bounded-context orchestration model, this note interprets that relationship more naturally as a task-shaped cost measure ||P||_t ≤ M — the cost norm depends on what you're asking the model to do.

This sharpens the context-efficiency note's two-axis model. Volume and complexity are not independent benchmarks you read off a spec sheet. In the interpretation developed here, they are two dimensions along which prompts consume bounded effective budget. Long-context claims and raw token counts hide that dependence, which is why architectural responses must manage scope, framing, and decomposition — not just chase larger windows.

Caveats. The evidence is convergent, not final. Paulsen does not cleanly isolate pure volume from task difficulty (counting, sorting, and filtering are themselves LLM-hard). ConvexBench does not measure a joint volume-complexity surface — it shows complexity alone can dominate. Position, framing, and scope contamination may also affect usable context, and this note treats those as part of prompt difficulty rather than as separately measured variables. The claim should stay qualitative: effective context is not a fixed per-model constant, and any theory that treats it as one is too coarse.

Open Questions

  • ~~Is there a clean empirical regime where volume can be varied while task difficulty and compositional complexity stay mostly fixed?~~ Answered by GSM-DC: distractor count varies while the solution DAG is fixed. The regime is synthetic (math word problems), but the isolation is clean.
  • Can the task-shaped cost measure ||·||_t be made concrete enough for useful prediction, or is it mainly explanatory?
  • Which natural-language tasks exhibit the same complexity-dominant collapse that ConvexBench shows in symbolic reasoning?

Sources: - Liu et al. (2026). ConvexBench: Can LLMs recognize convex functions? — complexity can dominate context usability even at trivial token counts. - Paulsen (2025). Context Is What You Need — The Maximum Effective Context Window — MECW is much smaller than MCW and varies by problem type. - Yang et al. (2025). GSM-DC: How Is LLM Reasoning Distracted by Irrelevant Context? — irrelevant context degrades effective capacity at fixed task difficulty, with power-law error scaling and dual-channel degradation.

Relevant Notes: