Cludebot

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

Cludebot (sebbsssss/cludebot) now ships as a pnpm/turborepo monorepo — packages/brain (the clude-bot TypeScript SDK and MCP server), packages/database (Supabase SQL schemas), packages/shared (constants, clients, utilities), plus apps/ for the server, chat UI, dashboard, mobile app, and workers. The core value proposition is a cognitive memory layer for LLM agents: typed storage with differential decay, hybrid retrieval, an automated "dream cycle" that consolidates and compacts memories over time, action-outcome learning, and clinamen anomaly retrieval. Three deployment modes are live — hosted HTTP API with CORTEX_API_KEY, self-hosted Supabase + pgvector + Anthropic, and a local JSON store at ~/.clude/memories.json. The MCP server registers nine tools across those routes, but some capabilities are mode-specific. README pitches benchmark numbers (1.96% hallucination on HaluMem).

Repository: https://github.com/sebbsssss/cludebot

Core Ideas

Five typed memories with per-type decay and retrieval boost. packages/shared/src/utils/constants.ts fixes DECAY_RATES at episodic 0.93 / semantic 0.98 / procedural 0.97 / self_model 0.99 / introspective 0.98 per 24 h cycle, and KNOWLEDGE_TYPE_BOOST adds +0.15 / +0.12 / +0.10 / +0.12 to non-episodic types at recall time. The type is supplied by the caller of brain.store() — there is no content-based classifier. Decay runs daily (cron) and retrieval-time scoring multiplies a composite (recency + relevance + importance + vector similarity + graph boost + cooccurrence) by the stored decay_factor, with weights 1 / 2 / 2 / 4 / 1.5 / 0.4 respectively — vector similarity dominates when embeddings are configured.

Seven-phase hybrid recall pipeline. packages/brain/src/memory/memory.ts runs: (1) optional LLM query expansion via OpenRouter/Llama with a timeout; (2) parallel vector search at memory-level and fragment-level through pgvector HNSW; (3) metadata + keyword filtering plus optional BM25 via the experimental tsvector RPC; (4) knowledge-seed injection (curated factual memories that always compete); (5) entity-aware expansion — direct entity recall plus co-occurring entity memories; (6) bond-typed graph traversal over ten link types weighted by BOND_TYPE_WEIGHTS (causes 1.0 > supports 0.9 > concurrent_with 0.8 > resolves 0.8 > happens_before/after 0.7 > elaborates 0.7 > contradicts 0.6 > relates 0.4 > follows 0.3); (7) composite scoring and owner-wallet guard. Fragment-level search decomposes each memory into summary/content-chunk/tag-context fragments, each separately embedded, so partial matches on long memories beat whole-memory averaging.

Six-operation dream cycle with event-driven and cron triggers, plus an optional JEPA phase. packages/brain/src/memory/dream/cycle.ts runs Consolidation (focal-point questions from recent episodic summaries, retrieval per question, evidence-linked semantic insights, with cop-out detection that discards "Good question..." responses), Compaction (old 7+ day episodic memories with decay < 0.3 and importance < 0.5 grouped by concept and summarized into a single semantic memory linked by evidenceIds), Reflection (self-model updates with evidence citations), Contradiction Resolution (resolves unresolved contradicts links and accelerates decay on the weaker side), Learning (runLearning dispatches to the action-learning pipeline), and Emergence (introspective synthesis, optionally posted to X or caught by setEmergenceHandler in SDK mode). The cycle is triggered when importanceAccumulator crosses REFLECTION_IMPORTANCE_THRESHOLD = 2.0 (with a 30 min min interval) or on the 6 h cron. A 10 min hard timeout wraps all phases. A Phase 4.5 Deep Connection (deep-connection.ts) runs between Learning and Emergence if JEPA_ENABLED=true: it sends sampled memories' embeddings and metadata to an external JEPA service, receives latent embeddings predicted for three relation types (supports, causes, elaborates), matches them against the memory store, filters out links normal recall would already find, and batch-creates the non-obvious links. A circuit breaker (5 failures → 5 min cooldown) degrades gracefully when JEPA is unreachable.

Action-learning loop is still the most complete trajectory-to-rule pipeline among reviewed systems. packages/brain/src/memory/action-learning.ts has three stages. logAction records an ActionRecord (action, reasoning, feature, trigger, related user) as an episodic memory tagged [action, action:<feature>, awaiting_outcome] with importance 0.5. logOutcome records the OutcomeRecord (sentiment, numeric score, measurement method) and updates the original action's importance based on sentiment. The refineStrategies pass groups outcomes by feature tag over the past week, and for any feature where negRate > 0.6 or posRate > 0.7 it asks Claude for a one-sentence procedural rule ("When... / Avoid... / Continue..."), stored as a procedural memory tagged [strategy, learned, strategy:<feature>] with importance 0.75. Successful strategies get a +0.05 importance bump per matching positive outcome (Hebbian reinforcement). A separate trackSocialOutcomes path fetches tweet engagement after 6+ hours and maps it to sentiment — the concrete oracle tying the loop to the social-bot origin.

Clinamen as structured anti-relevance retrieval. packages/brain/src/memory/clinamen.ts keeps the "divergence = importance × (1 − vectorSimilarity)" heuristic. Filters: importance ≥ 0.6, decay_factor ≥ 0.2, age ≥ 24 h, source not in {demo, demo-maas, consolidation}, and vector similarity < 0.35. The fallback path drops embedding requirements and shuffles high-importance old memories. Exposed both on the brain API and as the MCP find_clinamen tool.

Entity knowledge graph with co-occurrence. packages/brain/src/memory/graph.ts (617 lines) extracts entities via regex heuristics (Twitter handles, Solana wallet addresses, token tickers, capitalized multi-word names), stores them in a dedicated entities table with embeddings, links mentions with salience scores, and adds co_mentioned relations for entities that appear in the same memory. findSimilarEntities, getMemoriesByEntity, and getEntityCooccurrences feed the recall pipeline's Phase 5. The hosted mode currently does not expose the entity graph — the README's feature matrix confirms this is a self-hosted-only capability.

Source-aware Hebbian reinforcement with anti-confabulation gate. Every recall increments access_count, resets decay, and strengthens association bonds between co-retrieved memories by LINK_CO_RETRIEVAL_BOOST = 0.05. But INTERNAL_MEMORY_SOURCES = {reflection, emergence, consolidation, dream, active-reflection, introspection, dream-cycle, meditation} get gated: reinforcement scales by INTERNAL_REINFORCEMENT_GATE = 0.3 and importance bumps are +0.005 instead of the external +0.02. Rationale noted in code: Source Monitoring Framework (Johnson et al.) — dream-generated memories should not compound indefinitely through their own recall.

MCP server now carries nine tools across the hosted, self-hosted, and local routes. packages/brain/src/mcp/server.ts registers recall_memories, store_memory, get_memory_stats, find_clinamen, delete_memory, update_memory, list_memories, batch_store_memories, and extract_skill. Every tool auto-routes to local (JSON file), hosted (HTTP to clude.io), or self-hosted (direct Supabase) based on CLUDE_LOCAL, CORTEX_API_KEY, and SUPABASE_URL, but some capabilities are gated by mode. extract_skill is interesting: takes a domain string, runs seed retrieval plus optional graph expansion over the entity graph, and returns a structured markdown skills document (with optional provenance memory IDs). It is explicitly unavailable in local mode — it needs the entity graph and embeddings.

Confidence-gated recall as an opt-in experimental layer. packages/brain/src/experimental/confidence-gate.ts computes evidence sufficiency from four signals — coverage (how many results vs expected minimum), top score, type diversity, agreement (low stddev across top-3 scores) — weighted 0.30 / 0.35 / 0.15 / 0.20. When the composite falls below threshold (default 0.4), it returns a hedging instruction block to inject ahead of the memory context in the LLM prompt, splitting between "no evidence" and "weak evidence" templates. The recall MCP tool already reports the confidence score back to the agent when _evaluateConfidence is available. Lives under experimental/ alongside BM25, reranker, IRCoT, and RRF merge — all flag-gated via EXP_* env vars, with an orchestrator enhanced-recall.ts designed as a drop-in replacement for recallMemories. None of these are on by default.

Local JSON mode is deliberately minimal but complete. packages/brain/src/mcp/local-store.ts (362 lines) stores ~/.clude/memories.json atomically (write-tmp-then-rename), implements keyword + importance + decay + recency scoring, supports clinamen, runs without Supabase, embeddings, LLM calls, or network. No vector search, no entity graph, no dream cycles, no compaction. The trade-off: a self-contained fallback that an MCP-only user gets for free.

Comparison with Our System

Dimension Cludebot Commonplace
Storage substrate Supabase (Postgres + pgvector) or local JSON Filesystem of versioned markdown notes
Memory taxonomy 5-type enum with per-type decay and recall boost Typed notes (note, structured-claim, adr, index, related-system) with templates and link semantics
Write path SDK/MCP call with LLM importance scoring, fire-and-forget embedding + entity extraction + auto-link + optional Solana memo Human or agent edits a markdown file; validated by commonplace-validate
Retrieval 7-phase hybrid: expand → vector (memory+fragment) → keyword/BM25 → knowledge-seed → entity graph → bond traversal → composite score × decay rg search + description scan + tag index + qmd hybrid
Consolidation Automated 5-phase dream cycle with event-driven trigger and optional JEPA Phase 4.5 Manual note writing; /connect assists with discovery
Learning loop Closed action→outcome→lesson pipeline with sentiment oracle and Hebbian reinforcement Open: human observation, distillation into notes and instructions
Graph Entity graph + 10-type bond graph with traversal weights Markdown links with semantic verbs (extends, foundation, contradicts, enables)
Decay / lifecycle Continuous: per-type daily decay, compaction of faded old memories, contradiction resolution status frontmatter field plus human judgment
Inspectability Requires SQL/API/dashboard; local JSON is readable but flat Every note is a readable file under git
Anti-confabulation Internal-source reinforcement gating + confidence-gated hedging prompt (experimental) Semantic review bundle + type constraints
Integration surface 9-tool MCP server, HTTP API, TypeScript SDK, dashboard commonplace-* CLIs, skills, agent conventions

Where cludebot is stronger. Automated lifecycle. Cludebot's memories consolidate, compact, and resolve contradictions without human intervention, and the action-learning loop is still the tightest trajectory-to-rule system we have reviewed. Hybrid retrieval with vector + BM25 + entity graph + bond traversal is materially more sophisticated than anything we ship. Clinamen remains a unique anti-relevance mechanism. The confidence gate (even in experimental form) is a tool we have nothing analogous to — it gives the calling agent a number that says "stop; your evidence is weak."

Where commonplace is stronger. Structure and inspectability. A cludebot memory is a typed text blob with tags, an embedding, and optional links; commonplace notes have templates, frontmatter discriminators, explicit link verbs, index layers, and a workshop/library separation that cludebot has no equivalent of. Our content is readable files under git; cludebot's is rows in a database or a single JSON blob. The concept ontology that cludebot hardcodes (market_event, whale_activity, holder_behavior, price_action) is still crypto-flavored despite the README's disclaimer — adapting cludebot for a non-social-bot use case requires rebuilding the controlled vocabulary.

What changed since the previous review. The codebase refactored into a pnpm monorepo (packages/brain now holds what was src/, and the SDK/bot/apps are separated). The README still uses a five-phase label, but the implementation now has a 4.5th JEPA Deep Connection phase that calls an external service for latent-relation-type embedding prediction. The MCP surface grew from four tools to nine with full CRUD (delete_memory, update_memory, list_memories, batch_store_memories) and a new extract_skill that synthesizes domain skills documents from the entity graph. A whole experimental/ subsystem appeared with BM25, cross-encoder reranking, RRF merge, IRCoT multi-hop, temporal bonds, and confidence gating — all flag-gated, all documented with expected benchmark deltas. Hosted mode is more prominent; dream cycles now run server-side for hosted agents in hosted-dreams.ts. The JEPA path is the most substantive new mechanism: it is the first time cludebot leans on a learned model outside Claude/OpenRouter to drive memory structure.

Borrowable Ideas

Confidence-gated response generation. Ready to borrow. The four-signal score (coverage, top score, diversity, agreement) over retrieved results is cheap and gives the calling agent a concrete signal that its evidence is weak, plus a hedging instruction it can prepend. Our review bundle and /connect tools could emit something similar — when a KB search returns thin results for a query, tell the agent that in words rather than hoping it infers low confidence from the sparse matches.

Cop-out detection as a cheap quality gate for generated content. Ready to borrow. The regex array in cycle.ts catches the seven most common LLM stall patterns before they get stored as semantic memories. Trivial to add to any pipeline where we accept model-generated notes or review suggestions.

Split tagging from mutation. Ready to borrow. Cludebot's Hebbian reinforcement is done in deterministic database updates, not via the LLM — the model proposes associations, code updates counters and decay. Same pattern as ACE: keep score/counter updates out of the LLM path so they remain cheap, audit-able, and independent of generation quality.

Source-aware reinforcement gating. Ready to borrow as a principle. The INTERNAL_MEMORY_SOURCES set and 0.3x reinforcement gate is a concrete implementation of "don't let self-generated artifacts reinforce each other at the same rate as external signal." Applied to commonplace, this is the argument against letting agent-authored notes accumulate the same way human-curated ones do — they need a dampening factor in any downstream ranking or trust score.

Event-driven processing via importance accumulation. Needs a use case first. accumulateImportance + threshold + min-interval is a smarter scheduling primitive than fixed cron for any eventual maintenance sweep. We would need enough regular agent activity to generate the signal.

Extract-skill as a first-class tool. Worth watching. extract_skill(domain, depth, include_provenance) turns memories into a shareable markdown skills document — structurally similar to how we imagine commonplace-distill or a workshop-to-library promotion flow might work. Cludebot does it from the entity graph; we would do it from tag indexes plus type filters. Same shape.

Fragment-level embedding for long notes. Needs a use case first. We do not use vector search for primary retrieval, but if we add it, decomposing long notes into summary + chunks + tag-context fragments and embedding each beats whole-note averaging. The overhead is 3-5x embedding writes per note.

JEPA-style latent relation prediction. Research-grade only. Using a learned model to predict what should link to a memory — then filtering out what normal recall already finds — is a credible mechanism for surfacing non-obvious connections. For commonplace, the analogous move would be a connection suggestor that looks for links normal search does not surface. Requires a trained encoder; not currently worth the cost.

Curiosity Pass

Does the JEPA Deep Connection phase transform data or just relocate it? It transforms in principle — the JEPA service returns embeddings that represent "what supports/causes/elaborates this memory would look like," and the match phase finds store-memories near those points. That is a genuinely new link candidate set. But the code has three safeguards that limit the mechanism's reach: the filter against memories already returned by normal recall, the circuit breaker, and the markJepaQueried cooldown. Net effect: a small steady trickle of strength=0.5 links, most of which never get reinforced into the 0.8+ range where they dominate traversal. The simpler alternative is the existing entity co-occurrence graph, which is already doing most of the non-obvious linking work without a learned model. Whether JEPA earns its keep is an empirical question the repo does not answer.

What property does the typed-memory taxonomy actually produce? Two things: differential decay, and a recall-time knowledge-type boost that favors distilled content over raw episodes. Both are real, code-confirmed, and work. But the classification is caller-supplied at store time. There is no oracle deciding whether a given memory is actually episodic or semantic — which means the taxonomy's quality is the calling agent's quality at typing. A misrouted episodic-as-semantic memory decays 5x slower and gets a +0.15 rank boost it should not have.

Does the confidence gate actually prevent hallucination? Structurally, it reduces the opportunity for hallucination by injecting a hedging instruction when evidence is weak. Whether the instruction is followed depends entirely on the downstream LLM obeying it. The mechanism is a prompt, not an enforcement gate — the LLM can still generate whatever it wants. The claimed HaluMem numbers in the README (1.96% vs 21% average) would require this mechanism to be both active and effective in the benchmark run. Worth verifying externally; the code is doing the right shape of work, but the number depends on the full-system setup.

What is the simpler alternative to the 6-operation dream cycle? Every phase except Consolidation is a narrower operation. Compaction is "summarize and archive faded memories" — implementable as a nightly cron job with a single LLM call per concept group. Reflection is "update self-model"; Contradiction Resolution is "resolve flagged conflicts"; Learning is a separate pipeline dispatched from within the cycle; Emergence is "introspective synthesis." Bundling them into a ceremonial "dream" with inter-phase sleeps is evocative framing more than architectural necessity. The dream metaphor is sticky but the underlying mechanism is six unrelated batch jobs that happen to be triggered together. A simpler design: six independent cron tasks with their own triggers.

What could the action-learning loop actually achieve, even perfectly? The ceiling is "the agent's procedural rules reflect observed outcomes in the domains where outcomes are measurable." In cludebot's case that means tweet engagement → Twitter reply strategies. For domains without a sentiment oracle (private agent work, code tasks, reasoning), the loop degenerates to logAction only — the refiner cannot produce lessons without outcome data. The README's "A non-developer built a 5,750-line autonomous agent on Clude in two weeks" anecdote is the right test: did that agent's procedural memories improve over time, or did they just accumulate? Not answered in the repo.

The social-bot DNA is still visible. Despite the README's disclaimer ("override or ignore" the default concept labels), MEMORY_CONCEPTS still hardcodes market_event, whale_activity, holder_behavior, token_economics, price_action, engagement_pattern, recurring_user, identity_evolution. Entity extraction regexes still prioritize Solana addresses and token tickers. Social outcome tracking still fetches tweet metrics. The apps/ directory includes a price oracle and on-chain Anchor programs. The memory primitives are domain-general; the default configuration is not.

The experimental subsystem is more ambitious than the production pipeline. enhanced-recall.ts is described as a drop-in replacement for recallMemories. If the BM25, reranker, IRCoT, and confidence-gate combination actually delivers the benchmark improvements the README claims, one would expect it on by default. It is not. Either the default pipeline is already good enough, or the experimental layer has reliability/cost issues that keep it flag-gated. Worth watching whether any of the flags flip to on by default in coming releases.

Trace-derived learning placement. The trace source is twofold: live session traces from the autonomous bot (mentions, replies, user interactions ingested as episodic memories with source: mention|conversation|tweet|dm) and repeated action-outcome pairs from the agent's own operations (logged via logAction/logOutcome). Trigger boundaries are per-memory-store (for the importance accumulator), per-6-hour cron (for dream cycles), per-action (for logAction), and per-7-days (for the strategy refiner's feature grouping). Extraction happens at two layers: the dream cycle's Consolidation phase extracts focal-point-guided semantic insights from episodic summaries with LLM-parsed evidence citations; the strategy refiner extracts procedural rules from grouped action-outcome pairs using a caution/reinforce classifier based on negRate > 0.6 or posRate > 0.7. The cop-out filter and the confidence gate act as judges — they decide which LLM outputs get stored and how the calling LLM is prompted when evidence is thin. The concrete oracle for action learning is tweet engagement fetched 6+ hours later; for the hosted bot this is a live external signal. Promotion target is inspectable symbolic artifacts — rows in the memories table with types semantic, procedural, self_model and evidence-linked provenance. JEPA's promotion target is also artifacts: new memory_links rows with strength=0.5. No weights are produced; the external JEPA service is consumed, not trained by cludebot itself. Scope is service-owned per-agent (scoped by owner_wallet) and generalizable within an owner — strategies learned under one agent stay under that agent's owner scope, there is no cross-agent sharing. Timing is online during deployment — dream cycles run while the bot is live, action learning runs continuously, the hosted worker dreams for all registered agents every 6 hours. On the survey's axes, cludebot sits on axis 1 as service-owned trace backend (it owns the agent runtime and mines its own event stream, like OpenViking) and on axis 2 firmly on the artifact-learning side (with JEPA as an externally-hosted weight consumer, not a weight producer). Cludebot strengthens the survey's claim that the richest artifact-learning loops tie into a concrete outcome oracle — social engagement is what makes the strategy refiner's negRate/posRate meaningful, while the cop-out filter and confidence gate are weaker quality guards rather than synthesis oracles. It splits the existing cludebot placement only by adding the external-weights-as-consumer pattern (JEPA), which is more like a service dependency than a learned-weights promotion. No new subtype warranted; the system remains the clearest example of "service-owned trace backend + artifact promotion + concrete engagement oracle + optional external model for link suggestion."

What to Watch

  • Whether the experimental pipeline (BM25, reranker, confidence gate, IRCoT, RRF merge) moves behind the default on any of its flags. A production flip on confidence-gate would be the most interesting because it changes the downstream LLM contract.
  • Whether JEPA Deep Connection produces links that actually get reinforced through co-retrieval, or whether most strength=0.5 JEPA links stay at 0.5 forever. The data is there to measure; the repo does not expose a monitor.
  • Whether the concept ontology (MEMORY_CONCEPTS) gets generalized or becomes pluggable. A pluggable concept schema per owner-wallet would materially change the adaptability story.
  • Whether hosted mode gets entity-graph support. The README feature matrix currently lists "Entity graph: self-hosted only" — closing that gap would make the graph mainstream instead of optional.
  • Whether the HaluMem 1.96% benchmark result is independently reproducible with the default pipeline, or requires the full experimental stack. The code has both paths; the claim does not say which was used.
  • Whether extract_skill moves toward two-way integration with external knowledge systems (shared markdown? SKILL.md-style?). The current output is a one-shot export; bi-directional sync would make cludebot a memory layer for agent skills.
  • Whether the action-learning loop gets a non-social oracle — code review feedback, test pass/fail, user ratings. The pattern generalizes only as far as the oracle does.

Relevant Notes: