MemPalace

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

MemPalace is a Python memory system by milla-jovovich and Ben Sigman (v3.1.0, April 2026) that mines project files and conversation exports into a local ChromaDB "palace" and a sidecar SQLite knowledge graph, then exposes both through a CLI, 19 MCP tools, Claude Code stop/precompact hooks, and a 4-layer wake-up stack. The repo now leans hard on a single honest claim ("store everything verbatim, then make it findable with cheap metadata filters") and has pruned most of the grander marketing around AAAK and contradiction detection — a public retraction note sits at the top of the README pointing out where the original launch framing overreached. The interesting architectural bets in the current code are: ChromaDB drawers as the real substrate, a write-ahead log over every MCP write, per-agent diaries in Chroma as a second append-only memory class, and a palace-repair mode that treats the operational index as rebuildable from chroma.sqlite3.

Repository: https://github.com/milla-jovovich/mempalace

Core Ideas

Verbatim ChromaDB drawers are the only substrate the benchmark relies on. miner.py and convo_miner.py chunk source files (800-char chunks with 100-char overlap) or conversation exchange pairs into short documents, then palace.upsert() them into a single mempalace_drawers collection with wing, room, source_file, and chunk_index metadata. Every retrieval path — searcher.search_memories, layers.py L2/L3, MCP mempalace_search, even duplicate-check — queries that one collection and returns the stored text unchanged. The 96.6% LongMemEval headline explicitly comes from this raw mode; no synthesized memory object exists in the read path.

The palace taxonomy is real but mostly realized as ChromaDB where filters. Wings (project/person) and rooms (topic slugs) are populated at mine time and enforced at query time through {"$and": [{"wing": wing}, {"room": room}]} filters in every layer's retrieval code. palace_graph.py scans all drawer metadata, keeps rooms whose name appears in two or more wings, and reports them as "tunnels" — that is the only structural derivation beyond filtering. Halls exist as a fixed vocabulary (hall_facts, hall_events, hall_discoveries, hall_preferences, hall_advice) but I found them populated only in mempalace_diary_write (hard-coded hall_diary), not in mining; palace_graph.build_graph reads a hall key that is present in diaries but absent in ordinary drawers. The README's wings/rooms/halls/closets/drawers/tunnels vocabulary is currently (code-wise) wings + rooms + diary-halls + repeated-room-tunnels.

AAAK is a lossy structured summarizer, not a storage format. dialect.py is a 1075-line regex + keyword table that extracts entities, topic words, one key sentence, emotion codes, and flags from plain text into pipe-delimited zettels. The module file docstring now explicitly says "AAAK is NOT lossless compression. The original text cannot be reconstructed from AAAK output." The README retraction (April 7, 2026) concedes the same point and publishes LongMemEval evidence that AAAK mode regresses from 96.6% to 84.2% versus raw mode. mempalace compress still writes results into a separate mempalace_compressed collection; the main mempalace_drawers store remains raw text. AAAK survives as a diary style and as a bundled spec the MCP server ships to agents through mempalace_status.

The knowledge graph is a real second substrate with invalidation but no auto-extraction. knowledge_graph.py is a SQLite store with entities and triples tables (subject, predicate, object, valid_from, valid_to, confidence, source_closet, source_file) plus indexes and WAL mode. query_entity(as_of=...) filters on validity windows, invalidate() sets valid_to, timeline() returns chronological facts. The promise is Zep-like temporal facts, local. What it does not have is automatic extraction: fact_checker.py is gone (removed from the source tree; only the README still mentions it, now flagged as "not currently called automatically"), and the graph is only written through explicit mempalace_kg_add tool calls plus a seed_from_entity_facts bootstrap. The graph mechanism is honest and narrow — agent-driven triple writes with real temporal semantics — not an automatic fact-mining loop.

Every MCP write is journaled to a write-ahead log before execution. mcp_server.py maintains ~/.mempalace/wal/write_log.jsonl with restrictive perms (0700 dir, 0600 file) and prepends a _wal_log entry before add_drawer, delete_drawer, kg_add, kg_invalidate, and diary_write commit. The stated purpose is audit and poisoning forensics; there is no replay or rollback tool yet. This is the clearest new integrity move in the v3.1 code: writes are cheap to trace after the fact even though the main store has no history of its own. Sanitization at the tool boundary (sanitize_name, sanitize_content, sanitize_query) and an explicit "system prompt contamination" mitigation in the search handler complement the WAL.

Per-agent diaries ride on the same Chroma collection as a second class of append-only memory. tool_diary_write stores each entry as a drawer in wing_<agent>/room=diary/hall=hall_diary with type: diary_entry metadata. tool_diary_read pulls them back by wing+room filter, sorts by timestamp, returns the last N. There is no revision, synthesis, or promotion — the diary is a per-agent append log inside the same ChromaDB the palace uses, differentiated only by metadata. The README frames this as "Letta-without-the-subscription" agent memory; in code it is a timestamped drawer namespace.

Heuristic memory-type extraction runs without LLMs on --extract general convos. general_extractor.py classifies paragraphs into five types (decision, preference, milestone, problem, emotional) using regex marker sets, a tiny positive/negative sentiment dictionary, a code-line filter, and a length bonus. A disambiguation step reclassifies resolved problems as milestones and positive-affect problems as emotional. The output feeds the convo miner but goes into the same Chroma drawers. This path is the main place where trace input is materially transformed rather than just chunked — but the transformation is classification-into-labels, not synthesis.

The repair tool treats the operational index as rebuildable from chroma.sqlite3. repair.py offers scan/prune/rebuild modes to recover from HNSW index bloat after the duplicate-add bug that plagued early v3. rebuild extracts all drawers, drops the collection, re-upserts with correct HNSW settings, and backs up only chroma.sqlite3. This matters for architecture reading: MemPalace treats ChromaDB's sqlite file as the canonical substrate and the vector index as a rebuildable derived view — closer to the "files + derived index" posture than pure ChromaDB opacity.

Comparison with Our System

Dimension	MemPalace	Commonplace
Primary substrate	Verbatim ChromaDB drawers + SQLite temporal KG	Markdown notes, links, instructions, workshop artifacts in git
Main unit	Exchange-pair or 800-char drawer with `wing`/`room` metadata	Claim-titled typed note with explicit link semantics
Retrieval model	Wing/room `where` filter + embedding similarity over raw text	Description-led search + authored indexes + read-next decisions
Knowledge transformation	Chunk + metadata-tag by default; optional AAAK summary sidecar; heuristic 5-class labels on `general` convos	Author, connect, validate, review; mature from workshop to library
Mutable facts	Dedicated temporal KG with `valid_from`/`valid_to` and explicit `invalidate`	Usually in notes; narrow operational stores only when warranted
Agent surface	19 MCP tools + Palace Protocol in every `status` call + stop/precompact hooks + agent diaries	Instructions, skills, commands, validation/review gates
Trace-to-durable loop	Convos mined to drawers; `general` mode heuristic-classifies into 5 buckets; diaries append-only; no synthesis or retirement	Workshop layer exists as theory, with deliberate (mostly manual) promotion into library
Write integrity	JSONL write-ahead log, input sanitization, duplicate-check-before-add, duplicate-dedup utility, HNSW repair tool	Git history as WAL; validate-and-review as integrity layer

MemPalace's strongest position is "make a full conversational corpus locally searchable with cheap metadata filters and durable agent journaling," with real ops plumbing (WAL, repair, sanitization, dedup). Commonplace's strongest position is "turn experience into inspectable, composable knowledge with explicit semantics and a maturation discipline." The deepest divergence is still the same as in v3.0: MemPalace pushes structure into runtime behavior (metadata filters, retrieval collections, MCP protocol, write-ahead log, repair), while commonplace pushes structure into the artifacts (claim titles, descriptions, link semantics, note types, library/workshop split).

A second divergence is sharper now. MemPalace's v3.1 code has obviously been through security and stability review — SQL-free input sanitization, shell-injection fixes on hooks, query sanitizer for prompt contamination, repair module for a real production bug, tests for every module. That is the grammar of a memory service, even one running locally. Commonplace's integrity grammar is instead git plus validate/review on authored artifacts.

Trace-derived learning placement. MemPalace qualifies as trace-derived because it mines conversations and agent sessions into durable local artifacts, but the learning is narrow and mostly stopless. The trace source is three streams: conversation exports normalized through normalize.py into a standard transcript (Claude, ChatGPT, Slack, plain text, five formats recognized in normalize.py); Claude Code session transcripts captured through the stop hook, which prompts the host agent to save selected current-session content at a 15-message interval; and agent-written diary strings passed directly through MCP. Trigger boundaries are per-exchange for convo mining, every 15 human messages (or pre-compaction) for the stop hook, and per-session (manual) for diaries. The extraction step varies by path: convos are chunked by exchange and labelled only by wing/room (no transformation), --extract general adds regex-classified memory types, and the stop hook and diary paths are host-agent-authored writes under the Palace Protocol rather than automatic extraction. No LLM-backed oracle evaluates what becomes signal in the miner/convo-miner paths; those paths either store verbatim or apply pure regex. The promotion target is always an inspectable artifact — a ChromaDB drawer (with sidecar SQLite for triples and JSONL WAL for writes); no weights, no compiled state. Scope is per-task / per-session: nothing aggregates across sessions, and there is no synthesis, scoring, retirement, or decay. Timing is online: convos can be mined at any time, the stop and precompact hooks fire during live sessions, diaries are written during sessions; there is no offline reprocessing step.

On the survey's two axes: axis 1 places MemPalace in single-session extension for its stop-hook and diary paths (it runs inside a host agent, prompts the host agent to save selected current-session content, and writes back into markdown-like drawers), but the convo-mining path leans toward a lightweight cross-agent aggregator — normalize.py handles five different transcript formats from different runtimes, closer to what cass-memory does, though MemPalace lacks a discovery layer and leaves format selection to the user. Axis 2 places MemPalace firmly in symbolic artifact learning at the minimal-structure end: drawers are raw text, general labels are five flat categories, diaries are timestamped strings, KG triples are the only structured records. It neither strengthens nor weakens a survey claim; it confirms the single-session / minimal-artifact cell with a more developed integrity layer (WAL, sanitization) than most neighbours, and does not warrant a new subtype.

Borrowable Ideas

Journal every write through a JSONL write-ahead log before it touches the substrate. Ready now. The _wal_log pattern in mcp_server.py is small (a dozen lines), cheap, and gives a real forensic trail for memory poisoning or audit review. If commonplace adds any subsystem with tool-driven mutation (review system sweeps, skill-driven note edits, automated linkers), a parallel WAL is worth mirroring — especially since the library layer's "history" currently lives in git commits that are coarser than per-write records.

Treat the operational vector index as rebuildable from the underlying sqlite file. Ready now. repair.py backs up only chroma.sqlite3 and rebuilds the HNSW index around it. That reinforces the "files + derived index" posture commonplace already prefers — and is worth citing explicitly when we add any derived indexing layer (FTS, embeddings). The insight is that canonical substrate plus rebuild instructions beats opaque database state.

Ship the agent contract as a tool response, not a prompt. Ready now. mempalace_status returns the full Palace Protocol and AAAK spec on every call, so any agent that discovers the server gets the operational rules as data. For any commonplace subsystem with strong usage expectations (review, fix, or workshop promotion flows), returning the protocol inline with the first tool response is more reliable than burying it in system prompts.

Keep a dedicated heuristic extractor as a no-LLM baseline for convo-to-memory pipelines. Needs a use case first. general_extractor.py is a legitimate baseline: cheap, deterministic, inspectable, and it gives a lower bound for trace-to-artifact extraction quality. If we ever add automated mining to the workshop layer, starting with regex+sentiment classifiers before an LLM path would keep the costly oracle out of the critical path and make the LLM's marginal value measurable.

Separate the temporal fact store from the main corpus, narrowly. Needs a use case first. The SQLite KG is a clean sidecar for facts that need valid_from/valid_to semantics and direct invalidation. If commonplace needs operational state with temporal windows (deprecation dates, decision lifetimes, review schedules), the pattern of a narrow rebuildable SQLite sidecar — rather than shoehorning validity into note frontmatter — transfers well.

Retract loudly when marketing outran the code. Ready now (as practice, not as code). The April 7, 2026 README note owns four specific overreaches (tokenizer math, "30x lossless," "+34% palace boost," contradiction detection wiring), publishes the regression numbers, and lists the fix plan. That posture is worth imitating whenever a commonplace claim turns out to rest on convention rather than mechanism — and it is the honest counterpart to our own curiosity-pass discipline.

Curiosity Pass

The v3.1 code is substantially more honest than v3.0. The most interesting move between last review and now is not a new feature but a concession: dialect.py's own docstring now contradicts the old "lossless" framing, fact_checker.py has been removed from the codebase, and the README opens with a public retraction. The remaining system is smaller and more defensible. A year-over-year review of the same system rarely shows that arc clearly.

"Palace structure = +34% retrieval" is a metadata-filter result in plain clothes. The retraction note concedes this: wing+room filtering against a ChromaDB where clause is a standard feature, not a novel retrieval mechanism. When the README still says "wings and rooms aren't cosmetic — they're a 34% retrieval improvement," it is measuring the value of having any structural prior at all, not the value of the palace metaphor specifically. The same lift would come from any categorical tags correlated with relevance.

The knowledge graph is less powerful than the README's Zep comparison suggests. knowledge_graph.py has real temporal validity and a clean query surface, but its only writers are MCP tool calls and a small bootstrap seed. Without automatic extraction from drawers — and with fact_checker.py now absent — the graph is closer to "a local triple store with time windows that an agent can explicitly write to" than to Zep-style automatic entity-relationship maintenance. It is valuable exactly for operator-curated facts; it does not currently learn a graph.

The 19 MCP tools are a wide surface on a narrow substrate. 7 read, 2 write, 5 KG, 3 graph, 2 diary. The value of the surface is the protocol guidance it ships (query before asserting, invalidate before replacing, diary after session); the underlying operations remain "semantic search over drawers" and "write to SQLite." Judging the system by tool count would overstate its transformation machinery.

Diaries are workshop artifacts dressed as library memory. A diary is an append-only, untyped, uncurated per-agent log that never synthesizes or retires. That is a workshop-layer shape, not a library one. The README's "each agent is a specialist lens on your data" framing holds only if someone reads the diaries back — and diary_read returns raw entries sorted by timestamp. This is a fine place to start for an agent scratchpad, but the "memory" label hides a missing revision lifecycle.

Live ops hardening is genuinely new. Write-ahead log, input sanitization, query sanitizer, HNSW repair, duplicate-check-before-add, dedup utility, HNSW bloat fix, sqlite-only backup path, WAL-mode SQLite. Reading the commit log since v3.0 shows the project has absorbed real bug reports and responded with defensive code rather than more features. That is the strongest signal of maturity in v3.1.

What to Watch

Whether automatic extraction into the knowledge graph actually lands — a wired fact_checker replacement would move KG maintenance from "explicit writes only" to real trace-derived fact mining.
Whether halls become populated outside diaries — if mining starts emitting hall_facts/hall_events/... metadata on ordinary drawers, the palace graph becomes more than a repeated-room-name detector.
Whether closets gain any real representation — in code they are still plain-text summaries pointed at by diagrams; AAAK-as-closet remains a proposal in Issue #30.
Whether the write-ahead log grows a replay or rollback mode, which would turn the audit trail into a real integrity primitive.
Whether cross-session aggregation appears (summarizing many diaries into one synthesized artifact, retiring stale triples, or promoting recurring drawers) — that is what is needed to move from storage-first to learning-first.
Whether the benchmark story stays clean as hybrid/reranked modes get published — the retraction note commits to labelling raw vs aaak vs rooms modes in benchmark documentation; drift would undo the trust the retraction bought.
Whether the convo miner's format coverage expands (five formats today) into enough heterogeneity to treat MemPalace as a cross-agent session aggregator rather than a single-source mining tool.

Relevant Notes:

trace-derived learning techniques in related systems — extends: MemPalace v3.1 is a single-session-extension / minimal-symbolic-artifact instance with an unusually developed write-integrity layer but no cross-session synthesis
files-not-database — contrasts: MemPalace treats ChromaDB's sqlite file as canonical substrate with a rebuildable HNSW index, which is the "operational database as derived index" variant of the same pattern
a-functioning-kb-needs-a-workshop-layer-not-just-a-library — sharpens: diaries and --extract general classifications are append-only workshop-shaped artifacts with no library promotion path
browzy.ai — compares: both keep durable local artifacts with a derived retrieval layer, but browzy compiles a wiki and MemPalace stays raw-drawer-first; both now publish public retractions when marketing outran the code
ClawVault — compares: both run a local-first vault over a typed memory structure with scored or metadata-tagged entries, but ClawVault's observation ledger and weekly reflection implement a maintenance lifecycle MemPalace lacks

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search