Ingest: Adaptation of Agentic AI

Type: kb/sources/types/ingest-report.md

Source: adaptation-of-agentic-ai-survey-post-training-memory-skills.md Captured: 2026-04-27 From: https://arxiv.org/html/2512.16301v3

Classification

Type: scientific-paper -- arXiv v3 survey/preprint with a formal taxonomy, extensive literature review, citations, comparative tables, and evaluation recommendations, but no new original experiment. Domains: agentic-adaptation, learning-theory, agent-memory, evaluation-methodology Author: Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, and Jiawei Han; the multi-institution author list includes established AI, ML, NLP, security, and data-mining researchers, so the paper is worth treating as a field-map signal rather than a single-system report.

Summary

The paper surveys adaptation in agentic AI under a four-paradigm taxonomy: A1 adapts the agent from tool-execution feedback, A2 adapts the agent from final-output or holistic rewards, T1 trains agent-agnostic tools, and T2 adapts tools under supervision from a fixed agent. Its most relevant contribution for this KB is not any one method but the organizing frame: post-training, memory, skill libraries, retrievers, planners, subagents, and tool ecosystems are all adaptation surfaces. The survey also argues that evaluation must be paradigm-aware, component-counterfactual, and dynamics-aware, because endpoint success rates hide data efficiency, forgetting, co-adaptation instability, safety regression, and tool-vs-agent attribution.

Connections Found

The connection report found a dense learning-theory and memory cluster. The source compares with Continual learning's open problem is behaviour, not knowledge, Deploy-time learning is the missing middle, and Treat continual learning as substrate coevolution: the paper gives a mainstream ML taxonomy for agent/tool optimization, while the KB gives an artifact-substrate and role taxonomy that better captures readable repo artifacts. It evidences Memory management policy is learnable but oracle-dependent and Agent memory is a crosscutting concern, not a separable niche by classifying memory according to update mechanism rather than treating it as one pluggable subsystem. It compares with Skills derive from methodology through distillation and Skills are instructions plus routing and execution policy: the paper covers acquisition and reuse of skill libraries, while the KB covers methodology distillation and harness binding. It also evidences The boundary of automation is the boundary of verification and extends Reliability dimensions map to oracle-hardening stages with adaptation-specific evaluation requirements.

Extractable Value

Optimization locus and supervision signal are not enough; artifact substrate is the missing axis. High reach. A1/A2/T1/T2 cleanly separates agent-vs-tool and execution-vs-output signals, but it hides whether the durable learned result lives in weights, external tools, memory stores, prose policies, or symbolic artifacts. This is exactly where the KB's substrate-coevolution frame adds value. [quick-win]
Memory and skills are adaptation mechanisms, not side modules. High reach. The survey's surprising useful move is to place adaptive memory, skill libraries, retrievers, and subagents inside the same adaptation map as post-training. This supports the KB claim that memory crosses storage, context engineering, and learning, while adding a broader literature vocabulary for tool-side adaptation. [quick-win]
T2 names a common pattern: keep the agent fixed and adapt its environment. High reach. The T2 category captures retriever tuning, search subagents, memory-update modules, and skill libraries trained from frozen-agent feedback. That maps cleanly onto many practical agent systems where changing the model is expensive but changing the surrounding artifacts or tools is cheap. [experiment]
Adaptation evaluation needs component counterfactuals. High reach. The paper's strongest evaluation prescription is to hold the agent fixed while swapping tools, or ablate tool use while holding the agent fixed, so gains can be attributed to the adapted component rather than endpoint task success. This sharpens the KB's oracle/evaluation notes with a concrete reporting standard. [quick-win]
Endpoint metrics erase adaptation dynamics. Medium-high reach. Data efficiency, retention-set performance, safety-performance trajectories, and co-adaptation stability are different questions from final accuracy. This extends reliability evaluation from static "does it work?" checks to "how did it learn, what did it forget, and what failure window appeared while learning?" [experiment]
Skill libraries have two separable questions: how skills are acquired and how skills are invoked. Medium reach. The survey focuses on acquisition routes -- demonstration, reflection, exploration, RL, and programmatic skill induction -- while the KB focuses on discovery, invocation, execution policy, and methodology provenance. Keeping those questions separate prevents "skill" from becoming a vague label. [quick-win]
The simpler account is oracle strength plus modular update cost. The four paradigms sound like a full taxonomy, but many trade-offs reduce to two simpler forces: where the strongest feedback signal exists, and which component can be updated cheaply without breaking the rest of the system. The taxonomy is useful when it preserves those mechanisms; it becomes easier to vary when used as field-labeling alone. [just-a-reference]

Limitations (our opinion)

This is a survey, not a primary empirical result. Its comparative claims depend on heterogeneous papers with different backbones, tasks, budgets, and evaluation protocols. Treat the taxonomy and literature map as valuable; treat quantitative cross-paradigm comparisons as prompts for follow-up, not settled evidence.

The A1/A2/T1/T2 taxonomy under-describes readable artifact learning. It sees memory and skills, but mostly through ML/tool-adaptation categories. Commonplace's central mechanisms -- maintained notes, instructions, schemas, tests, indexes, and skills as repo artifacts -- fit awkwardly unless an additional artifact-substrate axis is added.

The central claim is moderately easy to vary. One could redraw the same literature by feedback granularity, training time vs deploy time, component ownership, or artifact inspectability and still recover many of the same recommendations. The hard-to-vary part is narrower: adaptation decisions depend on what signal is available, what component is cheap to update, and whether evaluation can attribute gains to that component.

The evaluation section is stronger as an agenda than as a solved method. Counterfactual component swaps assume components are approximately separable, but adapted agents may change behavior when tools change. Living benchmarks need generated tasks that remain solvable, discriminative, and non-degenerate; the paper acknowledges this but does not solve the verifier problem.

Safety discussion is necessarily fast-moving. The paper names unsafe exploration, reward hacking, safety regression, and parasitic tool adaptation, but these risks depend on current protocol ecosystems and threat models. Use it for risk categories, not for durable prevalence estimates.

Recommended Next Action

Write a note titled "Agentic adaptation taxonomies need an artifact-substrate axis" connecting to treat-continual-learning-as-substrate-coevolution.md, continual-learning-open-problem-is-behaviour-not-knowledge.md, memory-management-policy-is-learnable-but-oracle-dependent.md, and skills-derive-from-methodology-through-distillation.md. It would argue that A1/A2/T1/T2 usefully classify optimization locus and supervision signal, but agent-operated KB methodology also needs to classify the durable learned artifact: weights, prose, symbolic code, external tools, memory stores, and skill libraries have different inspectability, update cost, verification surfaces, and maintenance loops.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search