cass-memory

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

A procedural-memory system for AI coding agents that harvests sessions from multiple tools (Claude Code, Cursor, Codex, Aider, Pi, Gemini CLI, ChatGPT) and compiles them into a shared, confidence-scored YAML playbook. Built by Jeffrey Emanuel in TypeScript on the Bun runtime; licensed MIT with an OpenAI/Anthropic rider. The project ships as the cm CLI plus an MCP HTTP server and is currently alpha (v0.2.5 at the time of review). It positions itself as the memory layer that turns scattered session logs into "procedural memory" — rules and anti-patterns scored by helpful/harmful feedback with a 90-day half-life.

Repository: https://github.com/Dicklesworthstone/cass_memory_system

Core Ideas

Three-layer "cognitive" pipeline over a flat file substrate. The system labels its layers episodic (raw session logs searched via the external cass engine), working (LLM-generated diary entries), and procedural (YAML playbook of PlaybookBullet rules). Mechanistically the pipeline is findUnprocessedSessions() → generateDiary() → reflectOnSession() → validateDelta() → curatePlaybook(), orchestrated by orchestrateReflection() with fine-grained locking: one lock on the workspace processed log, separate locks on global and per-repo playbook YAMLs, with reflection LLM calls executed without holding playbook locks. The psychology-borrowed labels do not shape mechanism — episodic = logs, working = summaries, procedural = rules — but the multi-phase pipeline (diary → reflect → validate → curate) is genuinely separated and testable.

PlaybookBullet as the durable unit. Every learned item is a Zod-validated object with id, content, category, scope, kind (rule vs anti_pattern), maturity (candidate → established → proven → deprecated), helpfulCount / harmfulCount, timestamped feedbackEvents[], sourceSessions, sourceAgents, tags, optional embedding, pinned, deprecated, and replacedBy. This is richer than flat "cheatsheet line" stores but still a one-line rule plus metadata, not a multi-paragraph artifact. Bullets live in ~/.cass-memory/playbook.yaml (global) with optional per-repo overlay playbooks; loadMergedPlaybook() combines them and mergePlaybooks() disambiguates by workspace.

Confidence decay with harmful-multiplied scoring. src/scoring.ts computes effectiveScore = (decayedHelpful − 4 × decayedHarmful) × maturityMultiplier, where each feedback event contributes 0.5^(ageDays / halfLifeDays) with a 90-day half-life (both configurable). Maturity multipliers are candidate: 0.5, established: 1.0, proven: 1.5, deprecated: 0. calculateMaturityState() reads decayed counts and applies ratio thresholds: harmfulRatio > 0.3 with enough signal → deprecated; below minFeedbackForActive → candidate; strong helpful signal below a harmful ratio cap → proven; else established. checkForDemotion() also auto-deprecates when the effective score drops below a negative threshold. This is the most developed feedback-scoring arithmetic in the reviewed playbook systems, and the decay directly addresses the stale-rule problem that plain append-only stores create.

Three feedback sources feed the scorer. (1) Explicit CLI / MCP feedback via cm_feedback / cm mark. (2) Inline comments in session transcripts: parseInlineFeedback() extracts // [cass: helpful b-xyz] and // [cass: harmful b-xyz] markers and turns each into a helpful/harmful delta. (3) Auto-recorded outcomes: extractRuleIdsFromTranscript() plus classifySessionOutcome() infer a session-level success | failure | partial | mixed verdict from the transcript and attribute it to referenced rule IDs; scoreImplicitFeedback() weights the signal by outcome, duration (fast/slow), error count, retries, and regex-matched sentiment (detectSentiment()). Auto-recorded events are stored as FeedbackEvent with a fractional decayedValue so implicit signal counts less than an explicit thumbs-up. A post-v0.2.3 fix de-duplicates rule IDs that already have inline feedback to avoid double-counting.

Jaccard-on-tokens conflict detection with directive markers. curate.ts pre-computes a ConflictMeta record per active bullet (tokenized content as a Set, plus booleans for negative/positive/exception markers). For each proposed add it uses a size-ratio fast-skip, then full Jaccard intersection-over-union on token sets, and flags three conflict kinds when overlap exceeds a threshold (0.1 with directive markers, 0.2 without): negation conflict (newNeg !== existingNeg), opposite directives (must vs avoid), and scope conflicts (always vs unless). Exact duplicates go through a hashContent map; near-duplicates reinforce the existing bullet rather than adding a new one. Deprecated bullets are never reinforced — a defense against "zombie" rules. The detector operates on bag-of-words tokens, so it cannot catch semantic conflicts between rules that use different vocabulary; the system has an embedding-based semantic.ts but curate.ts does not use it for conflict checks.

Anti-pattern inversion as a first-class transformation. invertToAntiPattern() converts a harmful bullet into a new bullet with kind: "anti_pattern", isNegative: true, content prefixed AVOID: <cleaned content>. Marked harmful N times, maturity reset to candidate, and tags extended with ["inverted", "anti-pattern"]. Provenance arrays are copied (not aliased) so the inverted rule carries its source sessions and agents forward. This is concrete machinery for negative knowledge: rather than just deleting a rule that failed, the system stores an explicit pitfall with the same subject matter.

Two-phase LLM extraction with bounded iteration. Phase 1 (extractDiary() in diary.ts) runs an LLM on a sanitized session transcript (secrets stripped, content capped at 50k chars) to produce a Zod-validated DiaryEntry with accomplishments, decisions, challenges, preferences, key learnings, tags, and search anchors. Diary IDs are deterministic content hashes for idempotency. Phase 2 (reflectOnSession() in reflect.ts) runs the reflector up to maxReflectorIterations (default 3) with early exit when an iteration produces zero deltas or total deltas hit 50. deduplicateDeltas() hashes delta content — add:hashContent(content), replace:bulletId:normalized-content, merge:sortedBulletIds — to prevent duplicates across iterations. After reflection, validateDelta() runs another LLM pass that can refine a rule's wording (REFINE maps to ACCEPT_WITH_CAUTION with reduced confidence) or reject it outright.

Cross-agent session mining. findUnprocessedSessions() in cass.ts discovers sessions through the external cass search engine via cassTimeline() and keyword-fallback searches; agent identity is inferred from path patterns (.claude, .cursor, .codex, .aider, .pi/agent/sessions). enrichWithRelatedSessions() queries cass during diary generation for sessions from other agents that match the current diary's challenges and learnings, and appends snippets to the reflector prompt under a RELATED HISTORY FROM OTHER AGENTS heading. Privacy audit logging (privacy-audit.jsonl) and agent allowlists gate cross-agent access. The ProcessedLog tracks processed session paths so reflection is idempotent across runs.

Onboarding loop with gap analysis. cm onboard is an agent-driven cold-start protocol rather than an automated pipeline: onboard status reports progress, onboard gaps categorizes coverage as critical | underrepresented | adequate | well-covered across ten fixed categories (debugging, testing, architecture, workflow, documentation, integration, collaboration, git, security, performance) using keyword-match heuristics, onboard sample --fill-gaps prioritizes sessions scoring highest against gap-targeted queries, and onboard read <path> --template returns a rich JSON package with topic hints, related existing rules, gap summary, and suggested focus — designed so the already-paid-for coding agent can do extraction work itself. Progress persists in ~/.cass-memory/onboarding-state.json.

Trauma guard as retrospective safety mining. src/trauma.ts holds a hardcoded DOOM_PATTERNS list (filesystem destruction, DROP DATABASE, terraform destroy, git push --force, git reset --hard, kubectl delete node|namespace|pv|pvc, aws terminate-instances, mkfs, dd of=/dev/, and others). scanForTraumas() runs a two-phase heuristic: search cass for sessions containing apology language (sorry, mistake, destroyed, lost work, …), then grep each matched session for doom patterns co-occurring with apology text. Hits become TraumaEntry records in per-repo and global traumas.jsonl. Pure regex plus proximity detection; no semantic model.

MCP server and dual output formats. cm serve exposes a JSON-RPC-over-HTTP MCP endpoint with tools/call for cm_context, cm_feedback, cm_outcome, memory_search, and memory_reflect, plus resources/read for cm://playbook, cm://diary, cm://outcomes, cm://stats, and memory://stats. The server enforces a 5MB body guard, a bearer-token MCP_HTTP_TOKEN env, and loopback-only defaults (MCP_HTTP_UNSAFE_NO_TOKEN required to disable auth on non-loopback). Output uses --json and --format toon (TOON is a token-efficient array-of-records encoding added in the v0.2.3→0.2.5 window specifically to reduce LLM-reader cost); stdout is data, stderr is diagnostics, exit 0 is success.

LLM provider abstraction. src/llm.ts uses the Vercel AI SDK to support OpenAI, Anthropic, Google, Ollama (added v0.2.4), AWS Bedrock, and an escape-hatch cli provider. OLLAMA_BASE_URL / OLLAMA_HOST are honored and take precedence over Zod defaults. Recent changes make baseUrl configurable per provider, and the doctor command reports availability per provider.

File-based storage with atomic writes and custom locking. No database. Playbooks are YAML, diaries are JSON, feedback/audit/trauma streams are JSONL, embeddings are cached JSON, and src/lock.ts provides a custom cross-process lock with directory creation guards. A reflection acquires a workspace-scoped orchestrator lock, then separate locks on the global and optional per-repo playbooks during the merge phase.

Comparison with Our System

Dimension cass-memory Commonplace
Storage YAML playbook + JSON diary + JSONL audit/trauma/outcome logs, custom file locks Markdown notes in git, frontmatter-typed
Knowledge unit PlaybookBullet: one-line rule + 25+ metadata fields Typed note: frontmatter + prose body + semantic links
Extraction LLM diary, then LLM reflector, then LLM validator Human and agent authorship under typed templates
Update policy Automated deltas (add, helpful, harmful, replace, deprecate, merge) via reflector Manual curation; progressive formalization
Feedback model Explicit mark + inline [cass: helpful b-xyz] + auto-classified outcomes, decayed 90d, harmful 4× status field (seedling / current / superseded), human judgment
Maturity candidate → established → proven → deprecated via scored thresholds text → note → structured-claim via manual promotion
Conflict / dedup Jaccard on bag-of-words + directive markers; exact-hash dedup Link articulation + semantic review + human judgement
Cross-agent First-class: indexes Claude Code, Cursor, Codex, Aider, Pi, others via cass Single shared KB; no session ingestion
Onboarding cm onboard with gap analysis across ten fixed categories No fixed categories; authoring into typed notes
Safety Trauma guard (regex doom patterns + apology proximity) None explicit
Integration cm CLI + MCP HTTP server + --json / --format toon commonplace-* CLI, markdown files
LLM providers OpenAI, Anthropic, Google, Ollama, Bedrock, CLI Provider-agnostic; consumed by agents, not called by the KB

Where cass-memory is stronger. Its cross-agent ingestion solves a real coordination problem we do not address. The arithmetic for confidence decay and harmful weighting is more developed than any maturity mechanism we have, and it is justified by the same concern commonplace has about stale content — just automated. The three-source feedback model (explicit + inline + auto) is a concrete answer to the "where does signal come from" question that automating-kb-learning-is-an-open-problem flags. TOON output and the MCP server show a level of integration polish that a future commonplace MCP interface could study.

Where commonplace is stronger. A cass-memory bullet is a one-liner with metadata; a commonplace note is a multi-paragraph argument with articulated links (extends, contradicts, grounds, …). Link semantics let a reader follow why artifacts relate, not just that they are similar-looking. We can represent caveats, scope conditions, and counter-examples inside a note; cass's Jaccard-with-directive-markers check can only spot word-level contradiction. Most importantly, commonplace has an articulated theory of learning (continuous-learning-requires-durability-not-weight-updates, substrate-class-backend-and-artifact-form-are-separate-axes-that-get-conflated) — cass-memory has a working pipeline without the corresponding theory of when its automation is safe.

Deepest divergence. Cass-memory optimizes for retrieval-speed-times-scoring on short rules. Commonplace optimizes for contextual competence — the ability to reason about a domain. Different use-cases: cass is a procedural assistant ("do this, avoid that"); commonplace is a library of claims ("this is why this works and when it does not").

Borrowable Ideas

Decay on last-checked dates. The exponential decay kernel is simple and effective for damping stale signal. Commonplace cannot use it directly because notes have no feedback events, but the same kernel on last-checked could surface notes due for re-review. Ready to borrow — mechanism applies to staleness dating rather than feedback scoring.

Inline feedback markers. // [cass: helpful b-xyz] is a lightweight human-readable convention that an agent can leave in artifacts while working. For commonplace, an equivalent marker in draft documents pointing at a note ID could become a signal that the note was consulted (helpful) or misleading (harmful), captured during review passes. Needs a use case first — we lack a session-transcript substrate to harvest from.

Anti-pattern inversion as a review-system output. Converting a refuted claim into an explicit AVOID: note preserves negative knowledge cheaply. Commonplace already does this informally through contradiction links; making it a first-class review operation (when semantic review flags a claim as refuted, emit an avoid-* note automatically) would be the analog. Ready to borrow — compatible with current review system.

Gap analysis with fixed category taxonomy. The coverage dashboard (ten categories, four coverage tiers, keyword heuristics) is a diagnostic most KBs lack. For commonplace we already have kb/notes/tags-index.md, but not a quantitative "critical / underrepresented" readout. Needs a use case first — our taxonomy is not fixed enough.

Dual output formats (JSON + TOON) with stdout-is-data contract. If commonplace builds an MCP or CLI surface for agents, committing to stdout = data, stderr = diagnostics, exit 0 = success from day one avoids the ad-hoc output-parsing problems many tools have. TOON specifically is aimed at reducing LLM-reader token cost. Just a reference — no CLI consumers yet.

Custom fine-grained locking on file artifacts. The orchestrator-lock / playbook-lock separation (hold the orchestrator lock for the whole run but acquire playbook locks only for the merge phase) is a clean pattern when an expensive LLM step sits between read and write. Commonplace's indexing and note-move operations could benefit from the same shape once concurrency is a concern. Just a reference — no concurrent writers today.

Trauma guard as "mine the log for failures, not just successes." The inversion of the usual "learn from what worked" framing is generalizable. For commonplace, a review pass that scans agent session traces for catastrophic KB operations (accidental deletion, malformed frontmatter commits, broken links introduced under time pressure) and writes them up as explicit anti-patterns would be the analog. Needs a use case first — requires session logging we do not yet capture.

Curiosity Pass

Does the "three-layer cognitive architecture" label shape mechanism? No. Episodic, working, and procedural are implemented as "logs, summaries, rules" — three successive representation layers in a pipeline. The cognitive-psychology framing adds narrative but not constraint. A reader would understand the system equally well if those layers were named phases. Contrast Hindsight in our survey, which names cognitive layers and implements consolidation thresholds taken from memory research.

Jaccard on tokens cannot catch semantic contradictions. Two rules saying opposite things in different vocabularies — "Prefer PostgreSQL" vs "Use SQLite" — have low token overlap and the detector will not fire, even though directive markers are present. The semantic.ts module computes embedding similarity, but curate.ts does not call it during conflict detection. For a playbook that accumulates rules from many agents this is the most dangerous class of conflict. The system has the raw material to do better; it has not been wired.

The feedback loop still depends on agent compliance with conventions. The auto-outcome pipeline added since v0.2.3 reduces the burden, but the best signal quality still comes from inline [cass: helpful|harmful b-xyz] markers that agents must remember to write, plus explicit cm outcome calls. classifySessionOutcome() degrades gracefully but a "successful" session can reinforce a rule that had nothing to do with the success — noisy attribution is the core problem of auto-outcome scoring and this system inherits it.

Schema breadth vs populated fields. PlaybookBullet has ~25 optional fields (scope, scopeKey, workspace, kind, state, maturity, promotedAt, lastValidatedAt, confidenceDecayHalfLifeDays, pinned, pinnedReason, deprecated, replacedBy, deprecationReason, sourceSessions, sourceAgents, reasoning, tags, embedding, isNegative). In practice a typical new bullet probably carries content, category, id, timestamps, feedback counts, and sourceSessions. The schema suggests a precision that LLM extraction does not always deliver; this is also true of the dimension scoring in scoreImplicitFeedback where sub-point increments ("errors>=2 = +0.7") imply calibration the heuristic does not actually have.

The ACE pipeline naming vs implementation order. AGENTS.md describes the pipeline as "Analyze, Curate, Extract" but orchestrateReflection() executes diary → reflect → validate → curate. Curation runs last. The ACE label appears retrofitted over an implementation that evolved independently. Harmless, but a hint that some of the documentation precedes rather than describes.

Trauma guard precision. The heuristic "apology language near doom command in the same session" will fire on discussions of dangerous commands, not just their execution. Pattern-matching rm -rf near "sorry" could hit a code review comment or an incident postmortem. Without an execution-trace source the detector is necessarily conservative in recall and noisy in precision.

Trace-derived learning placement. Trace source: raw session logs from multiple AI coding agents — Claude Code JSONL, Cursor, Codex CLI, Aider, Pi, plus file-path pattern matching for agent identity — discovered through the external cass search engine with a configurable lookback window (default 7 days, capped at N sessions); triggers are explicit CLI (cm reflect) or MCP (memory_reflect) invocations, not automatic event hooks. Extraction: a Zod-typed DiaryEntry per session (accomplishments, decisions, challenges, key learnings, preferences, tags, search anchors), then reflector-generated PlaybookDelta operations (add, helpful, harmful, replace, deprecate, merge); this is the same extractDiary() / reflectOnSession() pipeline described above, just recast at the artifact level. A separate LLM validator can refine or reject each delta, and an evidence gate (evidenceCountGate) checks whether cass history supports the proposed rule. The oracle is the LLM reflector plus heuristic classifySessionOutcome() over transcript sentiment regexes plus explicit [cass: helpful|harmful] comments — multi-signal but still without ground-truth labels. Promotion target: inspectable YAML playbook bullets with scored confidence and maturity states; never weights. Shared global playbook plus optional per-repo overlay. Scope: cross-agent, multi-session, with optional workspace scoping; a debugging rule discovered in a Cursor session becomes available to Claude Code on the next cm context call. Timing: offline batch, manually triggered — reflection is not online during the originating session, and the originating agent does not see its own newly-mined rules until the next query. On the survey's axes, cass-memory occupies the cross-agent session aggregator slot on axis 1 (still the clearest instance of that ingestion pattern in the survey) and the artifact-learning pole on axis 2 — there is no weight-promotion path. Since v0.2.3 the auto-outcome mechanism adds a third feedback source alongside explicit marks and inline comments, which sharpens rather than challenges the survey's claim that scored-flat-rule systems rely on multi-signal outcome attribution with weak oracles.

What to Watch

  • Whether auto-recorded outcomes (v0.2.3+) end up dominating explicit marks in practice, and whether that shifts scoring quality upward or downward — the system has not yet published stats on signal ratios.
  • Whether curate.ts eventually adopts embedding-based conflict detection from semantic.ts, closing the Jaccard-can't-see-semantics gap.
  • Whether the fixed ten-category taxonomy in gap-analysis.ts holds up as the playbook grows, or whether categories start splitting or consolidating.
  • Whether the TOON format gets adopted outside this repo — if so, commonplace may want a TOON exporter for cm-like consumers.
  • How the Ollama / Bedrock / CLI provider additions shift cost and latency: offline embedding and extraction could change when reflection is triggered (on session close vs daily batch).
  • Whether the trauma-guard approach evolves past regex + proximity toward structured execution-trace mining — the concept is sound but the current implementation has recall/precision problems.

Relevant Notes:

  • ACE — sibling: both implement playbook-learning with scored bullets and feedback counters; cass-memory adds cross-agent mining, richer decay scoring, and auto-classified outcomes where ACE has cleaner role separation (generator/reflector/curator)
  • ClawVault — sibling: both pipeline sessions into confidence-scored artifacts; ClawVault uses observation lines with c=/i= scores and session handoffs, cass-memory uses playbook bullets with decay and cross-agent mining
  • trace-derived-learning-techniques-in-related-systems — extends: cass-memory is the clearest instance of the cross-agent session aggregator ingestion pattern and stays on the artifact-learning pole
  • automating-kb-learning-is-an-open-problem — exemplifies: automated extraction plus scoring plus curation works here end-to-end, but signal quality is bottlenecked by agent compliance with feedback conventions
  • continuous-learning-requires-durability-not-weight-updates — exemplifies: cass-memory learns entirely through durable YAML artifacts, never weights — a concrete instance of the claim
  • constraining — exemplifies: playbook bullets constrain agent behavior by specifying rules and anti-patterns; confidence scoring and maturity states adjust constraint strength over time
  • context-efficiency-is-the-central-design-concern-in-agent-systems — exemplifies: cm context pre-assembles relevant rules and history to spend agent tokens on proven guidance rather than raw logs; TOON output is an explicit token-cost reduction move
  • substrate-class-backend-and-artifact-form-are-separate-axes-that-get-conflated — grounds: cass-memory sits squarely in the symbolic-artifact substrate with a file backend and one-line-plus-metadata artifact form