Agent Memory Systems

A survey of external agent memory systems — how AI agents store, retrieve, and maintain knowledge across sessions and tasks. We track knowledge bases, context-engineering layers, structured note-taking tools, and trajectory-learning loops, reading their source code wherever it is available.

Choosing or designing one? Start with the comparative review, which places the landscape across six architectural dimensions, then browse the repo-backed reviews — each one reads the actual code and reports what a system does, not what its README claims.

We track these systems not just to borrow ideas but to watch how they evolve. Convergence across independent projects is a stronger signal than any single design argument.

Coverage

Two coverage tiers. Systems with open-source repos get the deep path: clone the repo, read the code, write a review note here. Systems known only from a README or paper get the lightweight path: snapshot a single page into kb/sources/, run /ingest, and optionally add a standard note under lightweight/ when the system needs a stable place in this collection.

Browse the roster:

Repo-backed reviews — systems with open-source repos, reviewed from the code
Lightweight coverage — paper- or README-grounded systems with no inspectable repo
Collection directory index — top-level files in this collection (the cross-cutting notes below), with links into the two subdirectories above

Cross-cutting reads:

Comparative review — synthesises across both tiers
Adaptation survey review candidates — maps the agentic-adaptation survey's memory and skill systems to existing reviews and likely additions
Trace-derived learning techniques in related systems — broadens the comparison to artifact-learning and weight-learning systems fed by live traces
Thalo type comparison — detailed type mapping against the commonplace document types

Patterns Across Systems

Most systems here (ours, Ars Contexta, Thalo, ClawVault, Agent-Skills) independently converge on:

Filesystem over databases — plain text, version-controlled, no lock-in
Progressive disclosure — load descriptions at startup, full content on demand
Start simple — architectural reduction outperforms over-engineering
Trace-derived learning — trace-derived learning techniques in related systems broadens the comparison beyond pi-adjacent session mining to include artifact-learning and weight-learning systems fed by live traces and trajectories

The divergences are more revealing:

Storage model — Cognee uses a poly-store (graph + vector + relational with pluggable backends), Siftly uses SQLite, CrewAI uses LanceDB by default with optional Qdrant Edge, Hindsight uses PostgreSQL+pgvector, Zikkaron uses SQLite with FTS5+sqlite-vec, and SAGE uses SQLite+BadgerDB (personal) or PostgreSQL+pgvector (multi-node) as operational substrates, while the others keep files as the primary storage interface. OpenViking occupies a novel middle position: it presents a filesystem interface (viking:// URIs, ls/read/find operations) but the substrate is AGFS + vector index — filesystem as metaphor, not mechanism. Cludebot uses Supabase (PostgreSQL+pgvector) for its full mode but also offers a local JSON file store that is the closest a database-first system gets to filesystem-first. Cognee, Hindsight, CrewAI, Zikkaron, Cludebot, and SAGE are the furthest from filesystem-first: memories are opaque database records, not readable files
System boundary — CocoIndex sits one layer below most systems here: it is an incremental engine for maintaining derived vector/graph/relational projections, not a primary knowledge medium. That makes it more relevant to our "operational layer beneath the KB" question than to the note/link semantics question directly
Agent-facing UX — Napkin is the clearest example of treating CLI output itself as part of the memory architecture: hidden scores, match-only snippets, and next-step hints are all tuned for model behavior rather than human browsing. Most other systems focus on storage and retrieval internals but leave the interaction layer human-shaped
Packaging unit — most systems distribute concerns across multiple files (notes, configs, scripts, indexes), but o-o pushes the opposite extreme: each document is a self-contained polyglot file carrying rendering, update contract, shell dispatch, source cache, and changelog. That maximizes portability and local inspectability at the cost of modularity and inter-document structure
Grounding discipline — cognitive psychology (arscontexta) vs programming theory (commonplace, thalo) vs empirical operational patterns (Agent-Skills)
Formalization level — custom DSL (thalo) vs YAML conventions (commonplace) vs prose instructions (Agent-Skills)
Governance stance — most systems treat governance as advisory (instructions the agent should follow); Decapod enforces governance with hard gates (validation must pass, VERIFIED requires proof-plan); SAGE enforces with cryptographic gates (signed transactions, validator quorum, RBAC clearance levels) — two very different enforcement models, both structurally enforced rather than instructed
Access control — SAGE has structured multi-agent RBAC (clearance levels, domain-scoped permissions, on-chain agent identity); Cognee has relational ACLs with tenant isolation and per-dataset permissions; most other systems either have no access control or rely on filesystem permissions
Cross-agent knowledge transfer — most systems are single-agent or agent-agnostic; cass-memory is the first reviewed system to make cross-agent session mining a first-class feature, indexing logs from Claude Code, Cursor, Codex, Aider, and others into a shared playbook
Runtime self-modification — most frameworks have fixed agent topology defined at build time; OpenSage is the first reviewed system where agents can create subagents and scaffold new tools at runtime, though without quality gates on the created artifacts
Self-referentiality — only our KB is simultaneously a knowledge system and a knowledge base about knowledge systems

Open Questions

Does convergence on filesystem-first indicate a durable pattern, or a phase that will be outgrown?
Should high-volume ingestion in a file-first KB adopt a small operational database layer for stage state and indexing?
Will the programming-theory grounding produce better systems than the psychology grounding, or will they converge?
Are there systems we're missing that take a fundamentally different approach?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search