Agent Memory Systems

A survey of external agent memory systems — how AI agents store, retrieve, and maintain knowledge across sessions and tasks. We track knowledge bases, context-engineering layers, structured note-taking tools, and trajectory-learning loops, reading their source code wherever it is available.

Choosing or designing one? Scan the comparison table — one row per system, a plain-English description plus the handful of fields that actually discriminate. Then read the comparison, which is the existing 129-system synthesis rather than the refreshed 140-row matrix, and browse the repo-backed reviews under reviews/ — each reads the actual code and reports what a system does, not what its README claims.

We track these systems not just to borrow ideas but to watch how they evolve. Convergence across independent projects is a stronger signal than any single design argument.

How we review

Every review classifies a system's retained behavior-shaping artifacts in one shared vocabulary, so independent systems can be set side by side on the same terms. The vocabulary, and the activation distinction the reviews turn on, come from these theory notes:

The review type spec is the operational contract that turns these notes into the fixed review sections and the backticked lead tokens the matrix is parsed from.

Coverage

Two coverage tiers. Systems with inspectable implementations get the deep path: clone the repo, read the code, write a review note here. Systems known only from a README, paper, spec, or non-implementation repo get the lightweight path: snapshot the source into kb/sources/, run /ingest, and optionally add a standard note under lightweight/ when the system needs a stable place in this collection.

Browse the roster:

  • Repo-backed reviews (reviews/) — systems with open-source repos, reviewed from the code; the comparison table is the curated entry point

  • Lightweight coverage — paper- or README-grounded systems with no inspectable repo

Cross-cutting reads:

Comparison matrix

systems.csv is the machine-generated comparison matrix — one row per code-grounded review (source-tier: code-grounded, in reviews/), one column per comparison axis plus sample-origin metadata. It is rebuilt from source artifacts, not hand-maintained:

  • Generated by scripts/build_systems_matrix.py: scans reviews/ and extracts each review's comparison data (doc-grounded reviews under lightweight/ are excluded), then derives found_karpathy_llm_wiki_gist from exact GitHub repository URL matches in the ASISAS-2026 experiment's consolidated Karpathy gist core file. Re-run after reviews or triage artifacts change — python3 scripts/build_systems_matrix.py.
  • Analysed by scripts/analyze_matrix.py: reports per-column fill, value entropy, and redundancy to decide which columns earn a place in a human-readable table.

Extractable columns come straight from each review, so the matrix stays in sync with the prose:

  • Backticked lead tokens written in the review body where the finding is reached — storage_substrate (files/repo/sqlite/rdbms/vector/graph/kv/in-memory/prompt-registry/model-weights/service-object), representational_form (prose/symbolic/parametric), read_back_direction (pull/push/both), Read-back signal, write agency, curation operations, and trace-derived sub-fields. The token leads its own justifying sentence, so value and reasoning can't drift. The convention lives in the review type spec.
  • trace_derived from the review's trace-derived frontmatter tag.
  • found_karpathy_llm_wiki_gist from exact GitHub repository URL matches in the consolidated Karpathy gist core. The core has 51 members: 50 code-grounded rows represented in this matrix and one doc-grounded lightweight review outside systems.csv.

Remaining columns are hand-classified candidates the script lists but leaves empty; the analyzer flags them as too-sparse until filled. When populating a compound axis (e.g. the trace-derived sub-fields), record the raw observed value first and normalise it into a harness-agnostic vocabulary — the normalisation step is itself the test of whether the category generalises across systems.

Consumption rule: a human comparison table is for choosing a system, so it covers code-based reviews only. Lightweight reviews (doc-only or spec-only, lower authority) stay outside the generated code-grounded matrix and table until promoted to an inspected implementation review.

Current mature chooser fields are storage_substrate, representational_form, trace_derived, read_back_direction, and the Read-back signal one-hots. Pushing shipped static documentation is baseline context, not memory read-back. For system choice, the useful distinction is whether retained memory is pull-only, coarse-pushed, identifier-targeted, or inferred from the current content.

Patterns Across Systems

Most systems here (ours, Ars Contexta, Thalo, ClawVault, Agent-Skills) independently converge on:

  • Filesystem over databases — plain text, version-controlled, no lock-in

  • Progressive disclosure — load descriptions at startup, full content on demand

  • Start simple — architectural reduction outperforms over-engineering

  • Trace-derived learningtrace-derived learning techniques in related systems broadens the comparison beyond pi-adjacent session mining to include artifact-learning and weight-learning systems fed by live traces and trajectories

The divergences are more revealing:

  • Storage model — Cognee uses a poly-store (graph + vector + relational with pluggable backends), Siftly uses SQLite, CrewAI uses LanceDB by default with optional Qdrant Edge, Hindsight uses PostgreSQL+pgvector, Zikkaron uses SQLite with FTS5+sqlite-vec, and SAGE uses SQLite+BadgerDB (personal) or PostgreSQL+pgvector (multi-node) as operational substrates, while the others keep files as the primary storage interface. OpenViking occupies a novel middle position: it presents a filesystem interface (viking:// URIs, ls/read/find operations) but the substrate is AGFS + vector index — filesystem as metaphor, not mechanism. Cludebot uses Supabase (PostgreSQL+pgvector) for its full mode but also offers a local JSON file store that is the closest a database-first system gets to filesystem-first. Cognee, Hindsight, CrewAI, Zikkaron, Cludebot, and SAGE are the furthest from filesystem-first: memories are opaque database records, not readable files

  • System boundary — CocoIndex sits one layer below most systems here: it is an incremental engine for maintaining derived vector/graph/relational projections, not a primary knowledge medium. That makes it more relevant to our "operational layer beneath the KB" question than to the note/link semantics question directly

  • Agent-facing UX — Napkin is the clearest example of treating CLI output itself as part of the memory architecture: hidden scores, match-only snippets, and next-step hints are all tuned for model behavior rather than human browsing. Most other systems focus on storage and retrieval internals but leave the interaction layer human-shaped

  • Packaging unit — most systems distribute concerns across multiple files (notes, configs, scripts, indexes), but o-o pushes the opposite extreme: each document is a self-contained polyglot file carrying rendering, update contract, shell dispatch, source cache, and changelog. That maximizes portability and local inspectability at the cost of modularity and inter-document structure

  • Grounding discipline — cognitive psychology (arscontexta) vs programming theory (Commonplace, thalo) vs empirical operational patterns (Agent-Skills)

  • Formalization level — custom DSL (thalo) vs YAML conventions (Commonplace) vs prose instructions (Agent-Skills)

  • Governance stance — most systems treat governance as advisory (instructions the agent should follow); Decapod enforces governance with hard gates (validation must pass, VERIFIED requires proof-plan); SAGE enforces with cryptographic gates (signed transactions, validator quorum, RBAC clearance levels) — two very different enforcement models, both structurally enforced rather than instructed

  • Access control — SAGE has structured multi-agent RBAC (clearance levels, domain-scoped permissions, on-chain agent identity); Cognee has relational ACLs with tenant isolation and per-dataset permissions; most other systems either have no access control or rely on filesystem permissions

  • Cross-agent knowledge transfer — most systems are single-agent or agent-agnostic; cass-memory is the first reviewed system to make cross-agent session mining a first-class feature, indexing logs from Claude Code, Cursor, Codex, Aider, and others into a shared playbook

  • Runtime self-modification — most frameworks have fixed agent topology defined at build time; OpenSage is the first reviewed system where agents can create subagents and scaffold new tools at runtime, though without quality gates on the created artifacts

  • Self-referentiality — only our KB is simultaneously a knowledge system and a knowledge base about knowledge systems

Open Questions

  • Does convergence on filesystem-first indicate a durable pattern, or a phase that will be outgrown?

  • Should high-volume ingestion in a file-first KB adopt a small operational database layer for stage state and indexing?

  • Will the programming-theory grounding produce better systems than the psychology grounding, or will they converge?

  • Are there systems we're missing that take a fundamentally different approach?


Complete file listing (generated at build time)