Ingest: Towards a Science of Scaling Agent Systems
Type: kb/sources/types/ingest-report.md
Source: towards-a-science-of-scaling-agent-systems.md Captured: 2026-03-08 From: https://arxiv.org/pdf/2512.08296
Classification
Type: scientific-paper -- arXiv preprint with formal task/system definitions, 180 controlled configurations, mixed-effects modeling, statistical tests, and out-of-sample validation on a later model family. Domains: multi-agent-systems, agent-architecture, scaling-laws, coordination Author: Kim et al. are researchers from Google Research, Google DeepMind, and MIT; the author signal is strong because the paper reports a controlled cross-model, cross-benchmark evaluation rather than a single-system demonstration.
Summary
This paper derives quantitative scaling principles for agent systems by comparing single-agent and four multi-agent coordination topologies across three LLM families and four agentic benchmarks. Its main contribution is not that multi-agent systems win or lose in general, but that coordination benefit is task-contingent: centralized or decentralized coordination helps decomposable analysis and web-navigation tasks, while every tested multi-agent variant degrades sequential planning. The paper identifies three mechanisms: tool-heavy tasks pay a coordination tax, multi-agent gains fade once a single-agent baseline is already above roughly 45%, and topology changes error amplification, with independent synthesis amplifying errors far more than centralized verification. For this KB, the paper is strongest as empirical evidence for context-efficiency, decomposition, and coordination-guarantee claims.
Connections Found
The connect pass found the strongest fit in the KB's agent-orchestration and error-correction cluster. It grounds synthesis-is-not-error-correction: Kim et al.'s Independent topology reports 17.2x error amplification under synthesis-only aggregation, which is direct evidence that merging outputs is not the same operation as voting or adjudication. It extends error-correction-works-above-chance-oracles-with-decorrelated-checks by quantifying how verification-bearing topologies reduce error amplification, with centralized coordination at 4.4x versus independent at 17.2x. It exemplifies agent-orchestration-needs-coordination-guarantees-not-just-coordination-channels because topology names matter only when they imply a guarantee such as adjudication or verification. It grounds bounded-context-orchestration-model and extends decomposition-heuristics-for-bounded-context-scheduling by showing when extra bounded calls help and when coordination overhead consumes the benefit. It also grounds context-efficiency-is-the-central-design-concern-in-agent-systems through measured overhead from 58% to 515% and success-per-token collapse, and extends the speculative topology-isolation-and-verification-form-a-causal-chain-for-reliable-agent-scaling without settling its causal-chain claim.
Extractable Value
-
Aggregation operation must match decomposition structure -- The highest-reach lesson is not "multi-agent hurts"; it is that redundant whole-task calls synthesized into one output amplify errors, while decomposed or verified calls can help. This is already partly captured by synthesis-is-not-error-correction, but the ingest should preserve the experimental anchor. [quick-win]
-
Error amplification ratios by topology -- Independent 17.2x, decentralized 7.8x, centralized 4.4x, hybrid 5.1x. These figures sharpen the error-correction and coordination-guarantee notes by turning "verification matters" into a measured topology effect. [quick-win]
-
Coordination overhead as context cost -- The 58%-515% overhead range and success-per-token collapse translate directly into the KB's context-efficiency frame: coordination consumes the same scarce channel as task reasoning, so extra agents can reduce useful work even when raw compute rises. [quick-win]
-
Decomposition stopping conditions need task features -- The paper's 45% single-agent baseline threshold, tool count effects, and PlanCraft degradation give empirical handles for when a scheduler should stop decomposing or avoid multi-agent coordination. This is high reach if adapted into scheduler heuristics rather than treated as a universal numeric threshold. [experiment]
-
Sub-agent capability may matter more than orchestrator capability -- The heterogeneity result suggests a lightweight scheduler can be paired with stronger bounded calls, matching the KB's scheduler/LLM separation direction. This is promising but model-family-specific in the paper, so it should remain a reference until independently replicated. [just-a-reference]
-
Message density saturation gives a coordination budget -- Performance plateaus near the reported message-density point, implying that "more communication" has a cost-benefit curve rather than monotonic value. This could become an operational budget for coordination channels. [experiment]
Limitations (our opinion)
The paper should not be over-read as a general verdict on multi-agent systems. Its most surprising result is that controlled multi-agent systems are often worse despite more calls and more coordination; the simpler account is that the tested systems frequently spent context on communication without adding a strong adjudication primitive. The central claim is fairly hard to vary for the tested configurations because prompts, tools, budgets, and topologies were controlled, but it becomes much easier to vary when generalized to "multi-agent systems" as a whole.
The largest missing configuration is voting with deliberate decorrelation. The paper mentions aggregation methods but the Independent topology is synthesis-only, so it tests error-preserving merge behavior more than error-correcting ensemble behavior. It also does not test systematic prompt perturbation, metamorphic checks, hard per-step oracles, or the maximal decomposition pattern seen in MAKER. That means the 17.2x Independent amplification is excellent evidence against naive synthesis, not evidence against redundant calls paired with voting.
The benchmark set is broad but still only four task families. The 45% single-agent threshold and architecture-selection accuracy may be fit to this benchmark mixture, model set, and task scoring style. Treat the threshold as a heuristic candidate for scheduler design, not a stable constant.
The paper also leaves the topology/isolation/verification causal question unresolved. Centralized and decentralized topologies differ in multiple ways at once: message pattern, verification opportunity, overhead, and information exposure. The results support the claim that topology matters, but they do not isolate whether topology creates benefits through decomposition, scoping, verification, or some interaction among them.
Recommended Next Action
Update decomposition-heuristics-for-bounded-context-scheduling: add a section on stopping conditions for decomposition, using Kim et al.'s task-contingent results to argue that multi-agent scheduling should consider single-agent baseline strength, tool count, decomposability, and whether the aggregation operation is synthesis, voting, or verification.