cass-memory
Type: note · Status: current · Tags: related-systems
A procedural memory system for AI coding agents that transforms scattered session logs into persistent, cross-agent knowledge. Built by Jeffrey Emanuel (MIT license, TypeScript/Bun), cass-memory implements a three-layer cognitive architecture modeled on human memory: episodic (raw session logs), working (structured diary summaries), and procedural (distilled playbook rules with confidence tracking). Among reviewed systems, it is the closest production-grade sibling to ACE's playbook-learning loop, but with genuine cross-agent session mining, richer feedback scoring, and a safety layer.
Repository: https://github.com/Dicklesworthstone/cass_memory_system
Core Ideas
Three-layer cognitive architecture with explicit memory types. The system explicitly names three layers — episodic, working, and procedural — and assigns each a concrete implementation. Episodic memory is raw session logs from any agent (Claude Code, Cursor, Codex, Aider, etc.), searched via an external cass search engine. Working memory is structured diary entries: accomplishments, decisions, challenges, key learnings, and preferences extracted per session via LLM. Procedural memory is the playbook: a YAML file of scored "bullets" (rules and anti-patterns) that agents retrieve before starting tasks. The three-layer naming maps to human cognitive psychology terminology, but the implementation is a straightforward pipeline: logs -> LLM summarization -> LLM reflection -> scored rule store. The cognitive framing adds no mechanism beyond what "extract, summarize, store" would provide.
Confidence-decayed playbook bullets as the learned substrate. The PlaybookBullet is the central data structure — a Zod-validated schema with 30+ fields including content, category, scope, maturity state, helpful/harmful counters, timestamped feedback events, optional embeddings, source sessions, source agents, and tags. The confidence scoring system applies exponential decay with a 90-day half-life: each feedback event's contribution diminishes over time via Math.pow(0.5, ageDays / halfLifeDays). Harmful feedback counts 4x as much as helpful feedback (configurable). Maturity progresses through candidate -> established -> proven -> deprecated based on decayed feedback ratios. This is the richest feedback-scoring mechanism among reviewed playbook-learning systems — richer than ACE's raw counters, richer than ClawVault's confidence/importance floats. The decay addresses a real problem: stale rules that were helpful six months ago may not be helpful now.
Cross-agent session mining. The system's distinctive claim is that knowledge flows across agent boundaries. Sessions from Claude Code, Cursor, Codex, Aider, and others feed a single memory store. The cass search engine indexes session logs from multiple agents and projects. During reflection, the system searches for related sessions across agents, enriches diary entries with cross-agent context, and extracts rules that any agent can later retrieve. Cross-agent enrichment is privacy-audited (per-session audit log, agent allowlists). This is a genuinely useful architecture for teams running multiple AI coding tools — the knowledge transfer problem is real and underserved.
Jaccard-based conflict detection and deduplication. Before adding new rules, the curation pipeline (curate.ts) checks for conflicts using pre-computed token sets with Jaccard similarity. Conflict detection uses heuristic markers: if two rules have high token overlap but opposite directive markers ("always" vs "never"), they're flagged as contradictions. Deduplication uses configurable similarity thresholds. This is a practical, deterministic mechanism that avoids the cost and unpredictability of LLM-based deduplication, though Jaccard similarity on bag-of-words tokens misses semantic conflicts between rules that use different vocabulary to say opposite things.
Anti-pattern inversion. When a rule accumulates enough harmful feedback, it doesn't just get deprecated — it gets inverted into an anti-pattern warning. A rule "Cache auth tokens for performance" that receives 3+ harmful marks becomes "PITFALL: Don't cache auth tokens without expiry validation." This is a concrete mechanism for negative knowledge — learning what not to do, not just what to do. Among reviewed systems, Dynamic Cheatsheet and Reflexion have comparable mechanisms (rewriting failing strategies into avoidance guidance), but cass-memory's version is more structured: it operates at individual rule granularity and preserves the provenance chain from the original rule through feedback events to the inverted anti-pattern.
Trauma guard as a safety layer. The trauma.ts module scans session history for "doom patterns" — regex-matched dangerous commands (rm -rf, DROP DATABASE, git push --force, terraform destroy) combined with "apology" keywords in session transcripts. Matched patterns are recorded as trauma entries. This is a novel mechanism among reviewed systems: rather than just validating what agents produce, it retrospectively detects catastrophic events and could prevent recurrence. The implementation is purely heuristic — regex matching on session text for co-occurrence of dangerous commands and apology language — but the concept of mining past failures for safety rules is sound.
MCP server and CLI with structured JSON output. The system exposes all operations via both a CLI (cm context, cm reflect, cm playbook, etc.) and an MCP server (cm serve). Every command supports --json with clean stdout/stderr separation: stdout is machine-parseable data, stderr is diagnostics. The MCP server exposes tools including cm_context, cm_feedback, cm_outcome, memory_search, and memory_reflect. This makes cass-memory usable as a memory backend for any MCP-compatible agent, similar to Cognee's MCP server but with a procedural-memory focus rather than a knowledge-graph focus.
File-based storage with file locking. Playbooks are YAML files, diaries are JSON, embeddings are cached JSON, audit logs are JSONL. All writes use atomic file operations with a custom lock implementation (lock.ts) for concurrent access safety. No database dependency. This aligns with the filesystem-first pattern seen in most reviewed systems, though the operational complexity (multiple JSON/YAML/JSONL files across global and per-repo directories) is higher than simpler single-directory approaches.
Comparison with Our System
| Dimension | cass-memory | Commonplace |
|---|---|---|
| Storage | YAML playbook + JSON diary + JSONL audit logs, file-locked | Markdown files in git |
| Knowledge unit | PlaybookBullet with 30+ Zod-validated fields | Typed note with frontmatter, prose body, and semantic links |
| Learning loop | Session logs -> LLM diary -> LLM reflection -> scored playbook bullets | Human+agent write -> connect -> validate -> mature |
| Update style | Automated: append bullets, update counters, decay, deprecate, invert | Manual curation with progressive formalization |
| Feedback model | Helpful/harmful events with exponential decay (90-day half-life), 4x harmful multiplier, maturity progression | Status field (seedling -> current -> superseded), human judgment |
| Cross-agent | First-class: indexes sessions from multiple AI agents and projects | Single-agent per session, no cross-agent knowledge transfer |
| Conflict detection | Jaccard similarity + directive-marker heuristics (deterministic) | Link articulation + human judgment (semantic) |
| Knowledge lifecycle | candidate -> established -> proven -> deprecated (automated by feedback) | text -> note -> structured-claim (human-guided) |
| Context engineering | cm context assembles relevant rules + history before task start |
CLAUDE.md routing + agent-driven progressive disclosure |
| Safety | Trauma guard: regex-scanned doom patterns + apology heuristics | None explicit |
| Curation depth | LLM-driven extraction, deterministic scoring, Jaccard dedup | Manual connection, link-articulation review, semantic-review QA |
Where cass-memory is stronger. Cross-agent session mining is a genuinely unserved problem that commonplace does not address. The confidence decay system is the most sophisticated feedback-scoring mechanism among reviewed playbook systems — it solves the stale-rule problem that pure append-only systems face. The anti-pattern inversion mechanism creates structured negative knowledge rather than just deleting bad rules. The CLI/MCP dual interface with clean JSON output makes integration with any agent framework straightforward.
Where commonplace is stronger. Knowledge has richer internal structure — notes contain arguments, evidence, caveats, and articulated links rather than flat rule strings with counters. Link semantics capture why notes relate (extends, grounds, contradicts), not just that they're similar. The maturation path (text -> note -> structured-claim) produces higher-quality knowledge over time rather than just scoring existing rules up or down. Most importantly: commonplace has a theory of learning that explains which operations count as learning and why — cass-memory has a pipeline that works but no framework for reasoning about when its automation is reliable.
The deepest divergence is the granularity and depth of the learned artifact. A cass-memory playbook bullet is a one-liner with metadata: "Always check token expiry before other auth debugging." A commonplace note is a multi-paragraph argument with evidence, caveats, links to grounding sources, and articulated relationships to other notes. The playbook format optimizes for retrieval speed and automated scoring; the note format optimizes for contextual competence — the ability to reason well about a domain, not just recall tips. These serve different use cases: cass-memory is better for quick procedural guidance ("do this, avoid that"), commonplace is better for understanding ("here's why this works and when it doesn't").
Borrowable Ideas
Confidence decay on learned artifacts. The exponential decay with configurable half-life is a clean mechanism for preventing stale knowledge from dominating retrieval. For commonplace, a lighter version could flag notes whose last-checked date exceeds a threshold, or weight search results by recency without the full scoring apparatus. Ready to borrow in principle — but our notes don't have feedback events, so the mechanism would apply to staleness dating rather than feedback scoring.
Anti-pattern inversion. Converting harmful rules into explicit warnings preserves negative knowledge rather than just deleting mistakes. For commonplace, the pattern would be: when a note's claim is refuted, rather than just marking it superseded, create an explicit "this approach fails because..." note that links back. We already do this informally through contradiction links, but cass-memory's automated inversion makes it systematic. Ready to borrow — the practice of writing "why not" notes is compatible with our current workflow.
Cross-agent session mining for knowledge transfer. The problem cass-memory solves — knowledge trapped in individual agent sessions — is real for any team running multiple AI tools. Commonplace doesn't address this because it's a single shared knowledge base, but a /reflect skill that could process session transcripts and extract observations (similar to ClawVault's observation capture) would be a lightweight version. Needs a use case first — we'd need enough multi-session volume to justify the pipeline.
Structured JSON output protocol. The discipline of stdout-is-data, stderr-is-diagnostics, exit-code-is-status is a well-executed integration pattern. For commonplace, if we build MCP or CLI interfaces, this protocol should be the default. Just a reference — we don't have a programmatic interface yet.
Trauma guard as retrospective failure mining. The concept of scanning past sessions for catastrophic events and encoding them as safety rules is novel. For commonplace, the pattern generalizes: mine agent history for failures, not just successes. When agents make mistakes during KB operations, capture what went wrong as explicit anti-patterns. Needs a use case first — requires session logging that we don't currently have.
Curiosity Pass
Does the cognitive architecture framing add mechanism or just vocabulary? The system names its three layers "episodic," "working," and "procedural memory" after human cognitive types. But mechanistically: episodic = searchable log files, working = LLM-generated summaries, procedural = scored rule store. The cognitive psychology labels don't shape the implementation — you could rename them "logs," "summaries," and "rules" and the system would behave identically. The framing adds narrative coherence for documentation but not mechanism. Compare with Hindsight, which also uses cognitive psychology framing but implements mechanisms (four-way parallel retrieval, consolidation thresholds) that are specifically inspired by memory research.
The Jaccard conflict detector catches vocabulary overlap, not semantic conflict. Two rules can contradict each other while using completely different words: "Prefer PostgreSQL for persistence" vs "Use SQLite for all storage needs." The Jaccard similarity between these is low (few shared tokens), so the conflict detector wouldn't flag them. The directive-marker heuristic (checking for "always" vs "never") partially compensates, but only catches syntactic contradictions. For a playbook that accumulates rules from multiple agents across many sessions, semantic conflicts between rules using different vocabulary may be the most dangerous kind. The system's own semantic.ts module provides embedding-based similarity that could be used for richer conflict detection, but curate.ts uses Jaccard, not embeddings, for the conflict check.
The ACE pipeline (Analyze-Curate-Extract) names differ from the README's ordering. The AGENTS.md describes the pipeline as "Analyze, Curate, Extract" but the actual orchestrator flow is: discover sessions -> extract diary (LLM) -> reflect on session (LLM) -> validate deltas -> apply to playbook -> curate (dedup, promote, deprecate). The curation step happens last, after extraction — so the "ACE" acronym doesn't match the actual execution order. This is cosmetic, but it signals that the cognitive-architecture branding was retrofitted onto an implementation that evolved independently of the naming.
How much of the 30+ bullet fields are actually populated in practice? The PlaybookBullet schema has fields for scope, scopeKey, workspace, kind, state, maturity, promotedAt, lastValidatedAt, confidenceDecayHalfLifeDays, pinned, pinnedReason, deprecated, replacedBy, deprecationReason, sourceSessions, sourceAgents, reasoning, tags, and embedding. Many have defaults. In practice, a typical bullet probably has content, category, id, timestamps, and feedback counts populated — with most optional fields at their defaults. The schema's breadth suggests aspirational coverage that may exceed actual usage, similar to how ClawVault's confidence/importance floats imply precision that LLM extraction can't deliver.
The feedback loop depends on agents actually leaving inline comments. The automated learning pipeline assumes agents will write // [cass: helpful b-xyz] or // [cass: harmful b-xyz] during their work. Whether agents actually do this depends on how strongly AGENTS.md instructions are followed. The implicit outcome-based feedback (session success/failure affecting all referenced rules) is a fallback, but it's noisy — a successful session might have used a harmful rule without the rule being the cause of success. The system's learning quality is bottlenecked by feedback signal quality, and the signal quality depends on agent compliance with a convention.
What to Watch
- Whether cross-agent session mining produces genuinely better rules than single-agent reflection. The claim that "a pattern discovered in Cursor automatically helps Claude Code" is powerful if true, but different agents may produce rules that are agent-specific rather than universal.
- Whether confidence decay prevents the playbook from growing unboundedly, or whether rules just accumulate with low scores. The system has deprecation but the threshold for automatic deprecation (harmful ratio > 0.3 with enough signal) may be too conservative.
- Whether the trauma guard catches real incidents or mostly produces false positives. The heuristic (apology language + dangerous command regex) could fire on discussion of dangerous commands rather than actual execution.
- How cass-memory relates to ACE as both mature: they share the playbook-bullet substrate and feedback-counter mechanism, but cass-memory has richer scoring and cross-agent mining where ACE has cleaner role separation.
Relevant Notes:
- ACE — sibling: both implement playbook-learning with scored bullets and feedback counters; cass-memory adds cross-agent mining and richer decay scoring, ACE adds cleaner role separation (generator/reflector/curator)
- ClawVault — sibling: both implement session-to-knowledge pipelines for AI agents; ClawVault has scored observations and session handoffs, cass-memory has cross-agent mining and confidence decay
- automating-kb-learning-is-an-open-problem — exemplifies: the feedback-scoring pipeline automates extraction and scoring but quality depends on agent compliance with feedback conventions — the curation gap applies here too
- trace-derived-learning-techniques-in-related-systems — extends: cass-memory is a code-inspected trace-derived artifact-learning system positioned alongside ClawVault and Autocontext in the survey; introduces the cross-agent session aggregator ingestion pattern
- continuous-learning-requires-durability-not-weight-updates — exemplifies: cass-memory learns entirely through durable artifacts (playbook YAML), not weight updates — a concrete case for the claim that continuous learning can happen outside of weights
- constraining — exemplifies: playbook bullets constrain agent behavior by specifying rules and anti-patterns; confidence scoring is a mechanism for adjusting constraint strength over time
- context-efficiency-is-the-central-design-concern-in-agent-systems — exemplifies:
cm contextpre-assembles relevant rules and history to spend agent context tokens on proven guidance rather than raw session logs