Ingest: Large Language Model Agents Are Not Always Faithful Self-Evolvers

Type: kb/sources/types/ingest-report.md

Source: large-language-model-agents-are-not-always-faithful-self-evolvers.md Captured: 2026-06-15 From: https://arxiv.org/html/2601.22436v3

Classification

Type: scientific-paper -- arXiv preprint (cs.CL) with explicit methodology, controlled causal interventions, multiple frameworks/backbones/environments, and a falsifiable empirical claim about agent behavior. This ingest covers v3, which retains v2's structure and findings while expanding the backbone set from 10 to 13 models (adding GPT-5.2, Gemini-3-Pro, Claude-Sonnet-4.6) — strengthening the cross-scale, cross-vendor generality of the result. Domains: agent-memory, distillation, verification, llm-agents Author: Multi-author research team (Zhao, Wang, Zhang, Deng, Zhao, Che, Qin, Liu) working on self-evolving-agent benchmarks. Worth attending to because they evaluate four established frameworks (ExpeL, Dynamic Cheatsheet, ReasoningBank, G-Memory) under causal intervention rather than proposing a new memory design — a diagnosis paper, not a sales pitch.

Summary

The paper asks whether self-evolving LLM agents actually depend on the experience they store, or merely carry it around. It defines experience faithfulness as the causal dependence of behavior on provided experience, then tests it by perturbing two memory forms — raw trajectories (Empty / Shuffle / Irrelevant) and condensed summaries (Empty / Corrupt / Irrelevant / Filler) — and checking whether downstream behavior changes. Across four frameworks, 13 LLM backbones, and 9 environments, the central result is a strong asymmetry: agents reliably depend on raw experience (perturbing it sharply degrades performance) but largely ignore condensed experience (perturbing or corrupting it barely moves performance), even when condensed memory is the only guidance provided. The gap holds across offline/online paradigms, single- and multi-agent settings, and model scales. RQ2 traces three causes: (1) condensed summaries are often too semantically vague to steer behavior; (2) internal processing bias overweights the current trajectory over retrieved memory — shown via layer-wise Integrated Gradients attribution where condensed-experience segments get consistently low attribution; (3) some task regimes (knowledge-intensive QA) are solvable from pretrained priors, so external experience never becomes causally necessary.

Connections Found

The KB already integrates this paper deeply — the companion connect report found a reverse-edge-only result (the snapshot authors no outbound edges; the signal is which notes should point at it). Three notes already cite the paper by bare arXiv URL and should upgrade to labelled local evidence edges: distillation (uses it as the warning case that compressed experience can stay plausible yet lose behavioral influence), evaluate-memory-by-effects (direct support for "evaluate by effects, not existence" plus the WITH/WITHOUT method), and activate-behavior-changing-memory (its Behavioral Faithfulness section already names this paper). Three new evidence candidates: knowledge-storage-does-not-imply-contextual-activation (the §5.2 IG result is a clean instance of "context-to-action failure" — knowledge visible in context but not integrated — and is the strongest new fit), continual-learning-open-problem-is-behaviour-not-knowledge (accumulation succeeds but behaviour-change does not), and a weaker evolving-understanding-needs-re-distillation-not-composition. Two cross-collection edges were deferred to a future agent-memory-systems-scoped run: the reviews of Dynamic Cheatsheet and Agent Workflow Memory, both systems this paper benchmarks and finds unfaithful to condensed experience. Net: this source is the KB's central empirical anchor for the distillation/activation/evaluate-by-effects cluster — v3 reconfirms it at larger backbone scale.

Extractable Value

Faithfulness-by-intervention as a standard evaluation for distilled/memory artifacts -- perturb (corrupt, shuffle, replace, blank) a stored memory and measure whether downstream behavior changes; if it does not, the artifact is present but inert. High reach: tests causal operativeness, not mere presence. Already partly codified in evaluate-memory-by-effects; v3 widens the empirical base to 13 backbones. [experiment]
Behavioral faithfulness as a missing quality criterion for distillation -- a summary is not good because it is concise or plausible; it must preserve enough structure to steer later behavior. This is the sharpest single takeaway for distillation: brevity-optimized condensation predictably strips behavioral utility. [quick-win]
IG attribution as direct evidence for "storage does not imply activation" -- §5.2's layer-wise Integrated Gradients result (condensed experience consistently low attribution; current trajectory dominates later layers) is a mechanism-level instance of knowledge-storage-does-not-imply-contextual-activation. New relative to that seedling, which currently leans on behavioral evidence only. [deep-dive]
Condensation is a systematic activation-killer (synthesis candidate) -- distillation, activation, and evaluate-by-effects notes independently lean on this paper; together they imply a higher-order claim the KB may not yet name explicitly: read-back is not activation, and compression-for-brevity reduces the very behavioral influence memory is stored to produce. The pieces exist; the named mechanism may not. [deep-dive]
Keep raw traces and condensed summaries separate, choose by task -- the paper suggests condensed guidance is not a drop-in substitute for replayable raw detail; some tasks need traces, others tolerate distillation, and knowledge-intensive tasks need neither. Operational warning for any KB learning loop that condenses by default. [experiment]
Design condensation for actionable specificity, not brevity -- the strongest mechanism claim is that vague heuristics lose to local context and pretrained priors; the v3 error table quantifies the dominant failure as "distraction from task goal" (up to ~79% on smaller models). Condensation methods should optimize for causal uptake. [experiment]
Empirical citation upgrade -- standardise the three existing arXiv-URL citations onto the local snapshot path and refresh v2 -> v3, consolidating the version drift in one place. [just-a-reference]

Limitations (our opinion)

This is editorial judgment. The strongest result is the raw-vs-condensed asymmetry; the strongest narrative is the three-cause explanation, and those are not equally well supported. The paper does not test condensation strategies designed explicitly for behavioral actionability, so the honest reading is "current summary forms often fail," not "compression as such fails" — a distinction the distillation note should preserve rather than overclaim. The benchmark mix tilts toward research environments and knowledge-intensive QA, which matters because the task-dependence cause partly reduces to "the model already knows enough"; that may transfer unevenly to software engineering or KB curation, where pretrained priors are weaker and raw traces costlier to carry. The internal-bias evidence rests on Integrated Gradients attribution, which is suggestive rather than definitive (attribution methods are notoriously sensitive to baselines), and the IG analysis runs on a single open-weight family (Qwen3), so the v3 expansion to 13 backbones strengthens the behavioral asymmetry but not the mechanistic claim. Finally, the paper measures whether memory is used, not whether it is good: an unfaithful summary can still raise scores, and a faithful memory can still be wrong. In KB terms this is a necessary but not sufficient complement to the verification/oracle notes — faithfulness checks that a memory is operative, not that it is correct.

Recommended Next Action

Upgrade the three existing bare-URL citations to labelled local evidence edges pointing at this snapshot and refresh them from v2 to v3 — in distillation, evaluate-memory-by-effects, and activate-behavior-changing-memory — then add a new evidence edge from knowledge-storage-does-not-imply-contextual-activation anchored on the §5.2 IG-attribution result. Defer the cross-collection edges (Dynamic Cheatsheet, Agent Workflow Memory reviews) to a connect run scoped to agent-memory-systems/COLLECTION.md. The "condensation is a systematic activation-killer" synthesis (Extractable Value 4) is the candidate worth a brainstorm, but it is a follow-on, not this step.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search