Ingest: Trajectory-Informed Memory Generation for Self-Improving Agent Systems

Type: scientific-paper

Source: trajectory-informed-memory-generation-self-improving-agents.md Captured: 2026-03-13 From: https://arxiv.org/html/2603.10600v1

Classification

Type: scientific-paper — peer-reviewed-track preprint with explicit methodology, controlled evaluation against baselines on a standard benchmark (AppWorld), ablation studies, and academic citations.

Domains: agent-memory, deploy-time-learning, trajectory-analysis, distillation

Author: Gaodan Fang, Vatche Isahagian, K. R. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, Gegi Thomas — IBM Research. Credibility comes from institutional backing and work on enterprise agent infrastructure; this is their first paper in the agent memory space specifically.

Summary

The paper presents a three-phase pipeline for agent self-improvement: (1) analyze completed execution trajectories to extract "tips" — categorized as strategy (from successes), recovery (from failures), and optimization (from inefficient successes) — at both task and subtask granularity; (2) store, generalize, and consolidate tips through LLM-based merging; (3) retrieve relevant tips at runtime via either cosine similarity or LLM-guided selection and inject them into the agent prompt before reasoning. Evaluated on the AppWorld benchmark with a ReAct-style GPT-4 agent, the best configuration (subtask-level tips with LLM-guided retrieval) achieves +14.3 percentage points on scenario goal completion for held-out tasks, with the largest gains on the most complex tasks (difficulty-3: +28.5 pp SGC, a 149% relative increase). The key finding is that granularity matters — subtask-level tips outperform task-level because subtask patterns recur across tasks — and retrieval strategy matters for consistency across task variants.

Connections Found

The /connect discovery identified 8 genuine connections. The paper lands squarely in the KB's learning theory and agent memory neighborhood:

Contrasts with AgeMem analysis (memory-management-policy-is-learnable-but-oracle-dependent): Both learn from execution trajectories but extract to different substrates. AgeMem learns a memory management policy (when to store/retrieve) encoded in opaque model weights. This paper learns content (what tips to extract) encoded in inspectable natural-language artifacts. Both depend on task-completion oracles. The paper is a complementary data point: memory management is learnable when oracles exist, and the output substrate determines the inspectability-automation trade-off.

Exemplifies the automating-KB-learning problem (automating-kb-learning-is-an-open-problem): The paper demonstrates a closed learning loop (execute -> analyze -> extract -> inject -> improved execution) that the KB's "boiling cauldron" aspires to, with an important qualifier: it works because it has a clear oracle (task completion). The KB lacks such an oracle.

Extends the comparative review (agentic-memory-systems-comparative-review): Adds a 12th system with a distinctive position — developer-managed external service agency model, no link structure between tips, and two-strategy retrieval. The tip consolidation step edges toward automated synthesis of operational knowledge, which is notable given the review's finding that "everyone automates extraction, nobody automates synthesis."

Exemplifies deploy-time learning (deploy-time-learning-the-missing-middle): Tips are durable, inspectable artifacts that persist across sessions and improve behavior without weight updates — precisely the "missing middle" between prompt engineering and fine-tuning.

Exemplifies distillation (distillation): Trajectory-to-tip extraction is distillation — compressing full execution traces into focused operational guidance. The subtask-level outperforming task-level is evidence that distillation granularity matters.

Also exemplifies: learning-is-not-only-about-generality (tips are low-reach accumulation), constraining-during-deployment-is-continuous-learning (tips constrain future behavior), and three-space-agent-memory-maps-to-tulving-taxonomy (tips map cleanly to procedural/operational space — none are semantic or episodic).

Extractable Value

Subtask granularity outperforms task granularity for tip transfer — concrete evidence that finer-grained distillation transfers better because subtask patterns recur across tasks. Directly relevant to how we think about distillation granularity in notes and skills. [just-a-reference]
Three tip categories (strategy/recovery/optimization) as a taxonomy for operational knowledge — this categorization by trajectory outcome (success/failure/inefficiency) could inform how we categorize lessons learned in workshop reviews or log entries. [quick-win]
LLM-guided retrieval dramatically outperforms cosine similarity for consistency — retrieval strategy drives scenario goal completion more than tip content quality. The gap is largest when task variants require the same underlying strategy but have different surface descriptions. Relevant to our retrieval design. [just-a-reference]
Tip consolidation via LLM-based merging as proto-synthesis — the paper's consolidation step (clustering semantically similar tips, then LLM-merging them) is the closest any surveyed system comes to automated synthesis. Worth tracking as a potential mechanism for the "boiling cauldron." [deep-dive]
Difficulty-3 tasks show the largest gains (+28.5 pp SGC) — complex multi-step tasks benefit disproportionately from accumulated operational knowledge. This suggests memory systems' value is non-linear with task complexity, which has implications for when deploy-time learning investments pay off. [just-a-reference]
The inspectability advantage over AgeMem is concrete — both systems learn from trajectories with similar oracle requirements, but this paper's output is diffable, searchable, and composable. This is a direct data point for the inspectable-substrate argument, and the paper shows no performance penalty for inspectability. [quick-win]

Limitations (our opinion)

Single benchmark, single model family. All evaluation uses AppWorld with GPT-4 in a ReAct configuration. The authors acknowledge this ("evaluation across additional model families as future work"), but the limitation is deeper than they state: AppWorld is an API-calling benchmark with deterministic success criteria. The tip categories (strategy/recovery/optimization) may reflect AppWorld's structure rather than a general taxonomy of operational knowledge. Whether these tips help on tasks requiring creative problem-solving, ambiguous goals, or multi-turn negotiation is unknown.

No comparison to simpler alternatives. The paper compares against a no-tip baseline and dismisses generic memory systems (Mem0, Letta) as storing "conversational facts." But it does not compare against simpler experience-based alternatives: few-shot examples from successful trajectories, chain-of-thought summaries of past tasks, or even direct trajectory appending. The three-phase pipeline may be over-engineered relative to simpler approaches that would capture similar signal. The "simpler account" check: could the gains come primarily from having any relevant prior experience in context, regardless of the categorization and consolidation machinery?

Tip consolidation is underdeveloped. The consolidation step — which the connection report flags as proto-synthesis — is described briefly and not ablated. We do not know whether consolidated tips outperform unconsolidated ones, whether the merging loses critical specificity, or whether the consolidation introduces errors. The most theoretically interesting component is the least evaluated.

No tip lifecycle or retirement. The paper describes tip extraction and consolidation but says nothing about what happens when tips become stale, contradictory, or counterproductive. In a continuously learning system, tip accumulation without retirement will eventually degrade performance as irrelevant tips compete for context space. The automating-kb-learning note's "Retire" mutation is absent here.

Oracle dependency is not discussed as a limitation. The paper treats task/scenario goal completion as a natural evaluation metric without acknowledging that this oracle is what makes the entire pipeline possible. Systems operating in domains without clear task-completion signals — open-ended knowledge work, creative tasks, advisory roles — cannot adopt this framework as-is. The memory-management-policy note's central insight (oracle dependency as the real bottleneck) applies equally here but goes unacknowledged.

No analysis of tip quality or failure modes. What do bad tips look like? Do incorrect tips (extracted from trajectories that succeeded for the wrong reasons) actively harm performance? The paper reports averages but no analysis of when tips hurt — understanding failure modes would be more valuable than the headline improvements.

No public source code. As of 2026-03-13, no code repository has been released. Web searches for the paper's title, arXiv ID, and author names yield no GitHub or other code hosting links. This limits reproducibility and makes it impossible to verify implementation details (e.g., exact consolidation prompts, retrieval thresholds).

Recommended Next Action

Update memory-management-policy-is-learnable-but-oracle-dependent: add a "Complementary approaches" section noting this paper as a contrast case — same input (trajectories), same oracle dependency, but inspectable output substrate (tips vs weights). The comparison sharpens the inspectability argument: this paper achieves comparable learning-from-trajectories gains without sacrificing inspectability, which strengthens the case that the substrate trade-off is real but that inspectable substrates are competitive. A single paragraph with a link to this ingest report would suffice.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search