xMemory
Type: ../types/agent-memory-system-review.md · Status: current · Tags: trace-derived
xMemory is HU-xiaobai's research code for "Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation." The repository implements a dialogue-memory pipeline for LoCoMo-style long conversations: buffer messages, cut episode boundaries, summarize episodes, extract semantic facts through prediction-correction, cluster facts into theme summaries, build auxiliary kNN and hierarchy files, then retrieve top-down with coverage selection and entropy-gated expansion. It is a real trace-derived memory system, but it is research code rather than a packaged service: the checked-in harness is LoCoMo-specific, PerLTQA is mentioned in the README but not implemented as a parallel evaluator in this checkout, and the graph export is mostly an inspection artifact rather than the retrieval engine.
Repository: https://github.com/HU-xiaobai/xMemory
Reviewed commit: 375ae1495095aa14a39eb169f83737f4779391c6
Last checked: 2026-05-16
Core Ideas
Dialogue messages are buffered until an LLM boundary detector or size limit creates an episode. The public facade exposes add_messages, flush, wait_for_semantic, search, and hierarchy-update helpers over MemorySystem (facade.py). MemorySystem.add_messages(...) turns incoming dictionaries into Message objects, appends them to a per-user MessageBuffer, optionally runs smart boundary detection, and creates episodes from the previous buffer when the detector says to split (memory_system.py, message_buffer.py). The detector is not a learned classifier in the repo; it is an LLM JSON prompt over conversation history and the new message, with confidence recorded but not used as a threshold (boundary_detector.py, prompts.py).
Episodes retain both an LLM summary and the raw source messages. EpisodeGenerator.generate_episode(...) asks an LLM for a title, narrative content, and timestamp, then stores the cleaned original messages in the Episode.original_messages field (episode_generator.py, episode.py). EpisodeStorage persists per-user JSONL files under storage_path/episodes, while Chroma stores title+content vectors for episode search (episode_storage.py, chroma_search.py). Raw dialogue messages therefore remain available as source evidence, but the default answer prompt sees episode summaries unless retrieval marks an episode for raw-message expansion.
Semantic facts are generated by prediction-correction rather than direct top-k summarization alone. After an episode is created, an event handler schedules async semantic generation. If prediction-correction is enabled, xMemory retrieves relevant existing semantic statements from Chroma, predicts what the new episode should contain from those statements, compares that prediction with the raw original messages, and extracts persistent facts from the gap (memory_system.py, semantic_generator.py, prediction_correction_engine.py). Those facts are stored as SemanticMemory JSONL rows with source episode IDs and confidence, then indexed in Chroma. A duplicate check compares embeddings against same-type existing memories, but the default semantic_similarity_threshold is 1, so only near-identical embedding matches are filtered unless config changes it (semantic.py, config.py).
Theme memory is an online aggregation layer over semantic facts. ThemeManager persists themes/{user_id}_themes.jsonl, local theme vectors, and optional Chroma theme entries. New semantic facts attach to a theme when centroid similarity clears DEFAULT_THEME_ATTACH_THRESHOLD = 0.62; a lenient floor at 0.52 can still attach to a not-full nearest theme; oversize or heterogeneous themes split when MAX_THEME_SIZE = 12 or average intra-theme similarity drops below MIN_INTRA_SIM = 0.72; very similar themes merge around MERGE_THRESHOLD = 0.78 if the combined size stays within the limit (memory_hierarchy.py). The scoring function combines sparsity and semantic cohesion over theme subgraphs, so the hierarchy is not only a display tree; it is a derived selection structure over semantic facts.
The adaptive retrieval path is top-down and coverage-oriented. The LoCoMo search script loads themes, semantic facts, episode summaries, semantic kNN JSON, and locally recomputed embeddings from the filesystem. It first embeds the question, takes a semantic pool, maps those semantics to candidate themes, selects representative themes with greedy coverage plus query score, restricts semantic candidates through selected themes, then selects representative semantic facts using semantic kNN coverage (xMemory_search_framework.py). Episode candidates come from source episodes behind selected facts plus vector-similar episodes. This is the code-level version of "decoupling and aggregation": the retrieval unit is not one flat top-k passage list, but a chain from theme representatives to semantic representatives to episodes.
Entropy gates decide when to include episodes and raw messages. _estimate_entropy(...) builds a prompt from selected themes, semantic facts, and candidate episodes, then estimates answer negative log likelihood through the local HF LLM client's logprob methods. _search_adaptive_hier(...) compares entropy before and after candidate episodes, keeps episodes whose information gain clears configured thresholds, and marks only high-gain episodes with expand_original=True; hydration then loads original_messages only for those marked episodes and only up to the configured expansion limit (xMemory_search_framework.py). This gives raw dialogue messages a late, uncertainty-gated authority rather than always forcing them into the prompt.
The storage layout is mixed JSONL, Chroma, local kNN, and GEXF. The README describes a memory directory with chroma_db, episodes, graph, semantic, semantic-knn, and themes, while the code currently writes episodes, semantic, semantic_knn, themes/vector, graphs/{user_id}_memory_graph.gexf, and graphs/{user_id}_memory_graph_embeddings.npy under storage_path (README.md, memory_system.py, memory_hierarchy.py). The GEXF graph records message/episode/semantic/theme nodes and cross-level edges, but the adaptive retriever does not read that GEXF file; it reads theme JSONL/vector files, semantic JSONL, episode JSONL, and semantic kNN JSON. evaluation/gexf_view.py is an analysis utility for graph summaries, not the main retrieval engine.
The evaluation harness is mostly LoCoMo-shaped. evaluation/locomo/add.py converts LoCoMo conversations into timestamped messages, constructs memories, waits for semantic generation, updates themes, updates the hierarchy graph, and verifies episode files. evaluation/locomo/xMemory_search_framework.py answers LoCoMo QA items, records memories and token stats, and supports baseline versus adaptive_hier. evals.py computes F1 and BLEU over result files, while generate_scores.py aggregates category means (add.py, xMemory_search_framework.py, evals.py, generate_scores.py). The README claims experiments on LoCoMo and PerLTQA, but this checkout contains no PerLTQA directory or script; PerLTQA is only named as an external dataset link in the README.
Comparison with Our System
| Dimension | xMemory | Commonplace |
|---|---|---|
| Primary purpose | Dialogue-memory retrieval for long-conversation QA | Agent-operated methodology KB with durable notes, references, instructions, reviews, and indexes |
| Raw substrate | Dataset dialogue messages retained inside per-user episode JSONL | Git-tracked markdown artifacts and source snapshots |
| Distilled artifacts | Episode summaries, semantic facts, theme summaries | Notes, ADRs, source reviews, instructions, skills, schemas, generated indexes |
| Derived indexes | Chroma episode/semantic collections, semantic kNN JSON, theme vector files, GEXF graph export | Authored links, generated directory indexes, review reports, validation outputs |
| Retrieval | Baseline BM25/vector/hybrid search or LoCoMo adaptive hierarchy with coverage and entropy gates | Agent-driven lexical search, links, indexes, type contracts, and task-specific skills |
| Behavioral authority | Retrieval scripts rank and select memory context; selected facts/themes/episodes advise the answer prompt | Knowledge artifacts advise; system-definition artifacts instruct, route, validate, enforce, or configure |
| Lineage | Semantic facts point to episode IDs; episodes retain original messages; themes list semantic IDs | Source URLs, reviewed commits, frontmatter status, archives, validation, and git history |
xMemory is stronger than commonplace as an experimental adaptive retrieval stack. It explicitly separates raw dialogue, episode summaries, facts, themes, vector indexes, kNN adjacency, and graph exports, then tests a retrieval policy that prefers high-level diverse representatives before spending tokens on episodes and raw messages. Commonplace has no comparable entropy-gated retriever over generated theme/fact hierarchies.
Commonplace is stronger as a governed knowledge system. xMemory's generated facts and themes have useful local lineage, but they do not carry review state, prompt-version provenance, contradiction handling, invalidation rules, or promotion boundaries. That is a reasonable tradeoff for benchmark QA; it would be weak if the same generated facts were promoted into durable instructions or methodology claims.
The important artifact split is behavioral authority. Raw dialogue messages and episode summaries are knowledge artifacts when consumed as evidence or context. Semantic facts and theme summaries are also knowledge artifacts as stored content, but the adaptive retriever gives them ranking and routing authority over what reaches the answer prompt. The threshold constants, config files, prompts, retrieval scripts, and entropy gates are system-definition artifacts: they configure extraction, boundary detection, clustering, selection, and expansion behavior. Chroma embeddings, kNN files, and vector arrays are derived ranking substrates rather than canonical knowledge.
Read-back: both — callers can search memory, and the adaptive QA harness injects selected themes, facts, episodes, and raw messages into the answer prompt.
Borrowable Ideas
Use high-level representatives before low-level expansion. Ready as an evaluation pattern. xMemory's theme -> semantic -> episode path is a good testbed for commonplace search: start from compact summaries or claim clusters, then expand only when a downstream uncertainty or coverage signal justifies it.
Make raw trace expansion explicitly gated. Ready for review and workshop contexts. xMemory's expand_original flag is a clear boundary between summarized evidence and raw trace evidence. Commonplace could use a similar distinction when deciding whether to load full source snapshots, review logs, or prior agent transcripts.
Treat kNN and GEXF as different artifacts. Ready as a caution. xMemory usefully separates semantic kNN JSON used by retrieval from GEXF graph export used for analysis. Commonplace should keep generated navigational aids similarly honest: an inspectable graph is not automatically an active retrieval engine.
Use prediction-correction for candidate fact extraction. Worth testing in low-authority surfaces. The predict-then-extract-gap prompt can reduce redundant fact creation, but it depends on LLM judgment and weak duplicate thresholds. In commonplace it belongs first in generated reports or candidate notes, not directly in library instructions.
Borrow the hierarchy, not the packaging. The code has useful mechanisms but no pyproject.toml, no tests in the checkout, and no installable package metadata beyond environment.yml; MemoryConfig.__post_init__() requires OPENAI_API_KEY even when the LoCoMo scripts pass a local HF LLM (environment.yml, config.py). The borrowable part is the retrieval decomposition, not the project shell.
Trace-derived learning placement
Trace source. xMemory qualifies as trace-derived learning. The source traces are LoCoMo-style dialogue sessions: speaker turns, timestamps, image captions, search queries, and QA evidence in the dataset. add.py turns those traces into message dictionaries with preserved dataset metadata before feeding them into the memory system (add.py, message.py).
Extraction. Extraction happens in stages. Boundary detection splits the stream into episode-sized units. Episode generation distills messages into a title/content/timestamp record while preserving original messages. Semantic generation either performs per-episode extraction or, by default, prediction-correction: retrieve old facts, predict the new episode, compare prediction with original messages, and extract missing persistent facts. Theme construction then groups semantic facts by embedding centroid, summarizes each group, and split/merges groups using size, similarity, sparsity, and semantic-cohesion criteria.
Storage substrate. Raw messages are stored inside episode JSONL rows. Episode summaries persist under storage_path/episodes/{user_id}_episodes.jsonl and are also indexed in Chroma. Semantic facts persist under storage_path/semantic/{user_id}_semantic.jsonl, Chroma semantic collections, and semantic kNN JSON. Theme summaries persist under storage_path/themes/{user_id}_themes.jsonl, themes/vector/{user_id}_embeddings.npy, and themes/vector/{user_id}_theme_ids.json, with optional Chroma theme upserts. The hierarchy graph persists as GEXF plus a NumPy embedding matrix under storage_path/graphs, but adaptive retrieval does not consume the GEXF.
Representational form. Raw dialogue and episode summaries are prose traces. Semantic facts and theme summaries are prose knowledge artifacts with symbolic IDs, timestamps, source episode lists, and JSONL structure. Chroma collections, local .npy arrays, and embedding caches are distributed-parametric retrieval substrates. Semantic kNN JSON and GEXF are symbolic graph/index artifacts. The adaptive retrieval script is symbolic system-definition code over those artifacts.
Lineage. The best lineage path is raw message -> episode ID -> semantic fact source_episodes -> theme semantic_ids -> retrieval result metadata. That is enough to trace a selected fact or theme back to episodes and often back to original messages. It is not enough to regenerate or audit every derived artifact: the code does not persist extraction prompt versions, LLM identity for construction inside the memory rows, split/merge decisions, entropy scores, or exact source spans for each fact.
Behavioral authority. Raw messages, episodes, semantic facts, and theme summaries are knowledge artifacts when they provide evidence, reference, or context. They become behavior-shaping at prompt time because ANSWER_PROMPT tells the answering model to use provided memories. The stronger system-definition artifacts are the prompts, config, thresholds, search scripts, coverage selector, entropy estimator, and expansion gate: those decide what memory is created, clustered, ranked, selected, expanded, and evaluated.
Scope. The implementation is per-user and per-benchmark storage under a configured memory directory. The learned contents do not become repo-wide rules, tests, or model weights. The mechanism is generalizable as a dialogue-memory architecture, but the checked-in evaluation path is scoped to LoCoMo.
Timing. Construction is staged offline/benchmark-time: add conversations, wait for semantic generation, update themes, update graph, then run QA retrieval. Retrieval-time adaptation is online for each question: it chooses themes, facts, episodes, and raw-message expansion from the already-built memory.
Survey placement. On the trace-derived learning survey, xMemory is a trace-to-hierarchy retrieval system. It strengthens the survey distinction between trace retention, distilled knowledge artifacts, and system-definition retrieval surfaces: the behavior change comes less from "having memories" than from a selection policy over layers with different cost and authority.
Curiosity Pass
The README's retrieval claim is more complete than the package surface. The adaptive hierarchy exists, but it lives in evaluation/locomo/xMemory_search_framework.py, not behind the main xMemory.search(...) facade. The facade's search method still calls MemorySystem.search_all(...), which uses conventional baseline methods unless the LoCoMo script chooses adaptive_hier.
The GEXF graph is easier to overstate than the code warrants. It stores a useful message/episode/semantic/theme graph for inspection, but the adaptive retriever reconstructs its own hierarchy from JSONL, kNN JSON, and embeddings. Calling xMemory "graph retrieval" would blur an exported graph with the active retrieval engine.
The most interesting authority move is the raw-message expansion gate. Raw messages are high-evidence but expensive; xMemory gives them late authority only when entropy reduction justifies it. That is a better design lesson than simply increasing top-k.
The weakest engineering boundary is reproducibility. LLM prompts create episodes, facts, predictions, and theme summaries, but construction rows do not appear to carry prompt versions or model identities. Retrieval records do save answer token stats and LLM identity metadata, but generated memory artifacts remain hard to audit after the fact.
The PerLTQA claim should be treated as paper/README scope, not checked-in code scope. The repository links the PerLTQA dataset, but this checkout's runnable evaluator is LoCoMo-shaped.
What to Watch
- Whether adaptive hierarchy retrieval moves from the LoCoMo script into the public
xMemoryAPI. - Whether a PerLTQA harness appears in source rather than only in the README claim.
- Whether construction artifacts start recording LLM identity, prompt versions, entropy/information-gain scores, and split/merge histories.
- Whether GEXF becomes an active retrieval input or remains an analysis/export artifact.
- Whether tests and package metadata appear; the current checkout is difficult to treat as a reusable library.
Bottom Line
xMemory is a code-grounded example of dialogue traces becoming a hierarchy of retained artifacts: raw messages, episode summaries, semantic facts, theme summaries, embeddings, kNN indexes, and graph exports. Its strongest lesson for commonplace is not a new storage substrate, but a retrieval contract: decouple behavior-shaping evidence into layers, select diverse high-level representatives first, and expand to raw traces only when the expected answer uncertainty drops enough to justify the context cost.
Relevant Notes:
- Trace-derived learning techniques in related systems - places: xMemory is a trace-to-hierarchy retrieval system with entropy-gated raw trace expansion.
- Axes of artifact analysis - exemplifies: xMemory's raw messages, episode summaries, semantic facts, themes, embeddings, kNN JSON, GEXF graph, and retrieval scripts have different substrates and authority.
- Knowledge artifact - distinguishes: xMemory's messages, episodes, semantic facts, and theme summaries advise by evidence and context.
- System-definition artifact - distinguishes: xMemory's prompts, thresholds, config, coverage selector, entropy gate, and evaluation scripts configure and route behavior.
- Activate Behavior-Changing Memory Before The Mistake - compares-with: xMemory's retrieval policy tries to activate compact high-level memory before answer generation, then selectively expands evidence.