h t t p s : / / a r x i v . o r g / h t m l / 2 6 0 1 . 0 1 8 8 5 v 1

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Type: academic-paper

Author: Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu (Alibaba Group, Wuhan University) Source: https://arxiv.org/html/2601.01885v1 Date: 2025

Abstract

This paper introduces AgeMem, a unified framework that integrates long-term memory (LTM) and short-term memory (STM) management directly into LLM agent policies. Rather than treating memory as external components, the system exposes memory operations as tool-based actions, enabling agents to autonomously decide when and how to store, retrieve, update, summarize, or discard information.

Key Contributions

Unified Memory Framework: A single system managing both LTM and STM through learnable, tool-based actions instead of heuristic pipelines
Three-Stage Progressive RL Strategy: Agents first acquire LTM storage capabilities, then learn STM context management, finally coordinating both under full task settings
Step-wise GRPO Mechanism: Addresses sparse and discontinuous rewards from memory operations through group-normalized advantage broadcasting across all trajectory timesteps

Core Technical Approach

Memory Management Tools

The system provides six primary tools: - LTM operations: Add, Update, Delete - STM operations: Retrieve, Summary, Filter

Problem Formulation

The agent observes state s_t = (C_t, ℳ_t, 𝒯) comprising conversation context, long-term memory store, and task specification. Actions span both language generation and memory operations, optimized through policy π_θ.

Three-Stage Trajectory Structure

Each trajectory τ = (τ^(1), τ^(2), τ^(3)) includes: - Stage 1: Casual interaction where agents identify salient information for LTM storage - Stage 2: Context reset with distractors testing STM management through filtering/compression - Stage 3: Formal task requiring coordinated retrieval and reasoning

Reward Function Design

The composite reward combines: - Task completion reward (R_task): LLM-based judgment of answer correctness - Context management reward (R_context): Compression efficiency, preventive actions, information preservation - Memory management reward (R_memory): Storage quality, maintenance operations, semantic relevance - Penalty terms: Discourage context overflow and excessive interaction rounds

Experimental Results

Evaluation across five benchmarks (ALFWorld, SciWorld, PDDL, BabyAI, HotpotQA):

Performance gains: 23.52-49.59% improvement over no-memory baselines
Memory quality: Higher semantic relevance of stored memories versus baselines
Context efficiency: 3.1-5.1% token reduction while maintaining task performance
RL contribution: 8.53-8.72 percentage point improvements from training strategy

Notable Findings

The paper demonstrates that: - Unified management outperforms independent LTM/STM optimization - Tool usage increases meaningfully post-training (Add operations 0.92→1.64 on Qwen2.5-7B) - Multi-component rewards accelerate convergence better than task-only signals - The system generalizes across diverse task types and model backbones

Limitations

The authors acknowledge that the fixed tool set, while effective, could support finer-grained control. Broader evaluation across additional task domains could strengthen empirical understanding.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search