Designing a Memory System for LLM-Based Agents

Type: kb/types/note.md · Status: current · Tags: agent-memory, context-engineering, learning-theory

An ideal memory system for LLM-based agents should be defined first by the needs it must satisfy, not by its storage architecture. It must preserve useful evidence, turn experience into future capacity, assemble the right context for bounded calls, steer future behavior when past lessons apply, and revise or retire memory when it stops earning its cost.

The first version of this note proposed an integrated design around trace, observation, episode, and library layers. That integration was premature. The narrower claim has more reach: agent memory is a context engineering (right-knowledge-into-bounded-context) problem. The hard part is deciding which remembered material should affect a future answer, action, artifact, or system rule.

The realistic target is a bounded memory system: one that names its consumers, captures or authors enough evidence for them, uses the strongest available signals first, exposes uncertainty, and moves learned material toward stronger behavior-changing surfaces only when evidence, authority, and maintenance economics justify it.

The Memory Problem Is Crosscutting

Agent memory cuts across the runtime rather than sitting in one component. Agent memory is a crosscutting concern, not a separable niche: storage belongs on the execution substrate, retrieval and activation belong in the context engine, and learning belongs in the loop that turns experience into readable or executable artifacts.

The base path is direct memory creation. A memory system cannot serve its core role until a human or agent can recognize a durable claim, policy, procedure, index, test, instruction, or plugin and write it into the right artifact. Trace-derived extraction then becomes a higher-order loop: the system studies its own sessions to discover memory-creation opportunities that direct authoring missed.

Memory Is More Than Retrieval

The learning loop should not stop at prose memory. Deploy-time learning is the missing middle: deployed agent systems improve across sessions by updating durable system-definition artifacts such as prompts, instructions, schemas, checks, scripts, plugins, and tools. When a learned pattern becomes deterministic enough, it should move toward codification (committing procedure to a symbolic medium), because bookkeeping work is more reliable on a symbolic substrate than when re-run through the LLM each time.

Calling those behavior-changing artifacts "memory" is not only an extension of retrieval vocabulary. Human memory vocabulary already distinguishes declarative memory, which supports explicit recall, from procedural memory, which is expressed as learned skill, habit, and action disposition. The KB's Tulving taxonomy note treats the semantic/episodic/procedural split as suggestive but not decisive. The agent analogue here is functional, not biological: a note that answers "what do we know?" is declarative memory, while a checklist, skill, test, guard, or instruction that changes what the agent does next is procedural memory.

That distinction matters because the same remembered material can serve different roles. Axes of artifact analysis distinguishes knowledge-role artifacts, which answer questions, from system-definition-role artifacts, which steer behavior. A decision rationale is knowledge when the agent asks "why did we choose this?" The same rationale becomes system-definition when it prevents the agent from proposing the rejected alternative again.

This role split prevents a common design error: treating memory as better retrieval-augmented generation (RAG). RAG is a declarative-memory pattern: ask a question, retrieve relevant knowledge, and put it in context. Agent memory also needs proceduralization: lessons must become instructions, skills, tests, checks, tools, guardrails, or work-surface changes that alter future action. Search can answer direct questions, but it does not decide which routines should be compiled, when they should fire unasked, or when they should be retired. Knowledge storage does not imply contextual activation: a stored lesson has not helped unless it appears in the right bounded context, with enough priority and framing to change what happens next.

Need 1: Create Memory Directly

Direct memory creation is the base operation. If current work reveals a stable claim, procedure, policy, index entry, validation rule, or tool extension, the natural memory operation is to write that artifact directly. Waiting for a later extractor to rediscover the same lesson from the transcript is a fallback.

This direct path is what makes Commonplace useful before any session-scraping pipeline exists. Notes, indexes, instructions, reviews, validation scripts, and generated indexes are all memory artifacts: they preserve learned structure and change what future agents can find or do.

Direct creation should not mean blank-page authoring. The system should help the agent choose the right artifact shape and satisfy that artifact's quality contract at write time. In Commonplace, collections carry quality and linking requirements, writing skills retrieve those requirements, and type references point to files that describe how to create artifacts of that type. The general pattern is broader than Commonplace: memory destinations should expose their creation contract, not just accept content after the fact.

Realistic memory artifacts include:

  • Notes and decision records for claims, rationales, alternatives, and negative results.
  • Instructions, skills, checklists, and runbooks for repeated work patterns.
  • Indexes and link maintenance for navigation and context discovery.
  • Tests, validators, scripts, plugins, and runtime extensions for deterministic learned behavior.
  • Work-surface updates when the authoritative destination is a ticket, report, dashboard, product configuration, or source document rather than the memory system itself.

Realistic authoring supports include:

  • Routing cues that suggest which artifact class should receive the learned material.
  • Collection or domain quality requirements loaded before writing.
  • Type-specific procedures, templates, schemas, or rubrics for producing valid artifacts.
  • Link and index obligations that make the new memory findable rather than merely stored.
  • Validators, linters, review gates, or preview checks that catch malformed or low-quality memory before promotion.
  • Import tools that convert external knowledge into the system's own artifact forms.

Direct authoring still needs evaluation and lifecycle management. A note can be accurate but hard to find. An instruction can steer behavior incorrectly. A check can fossilize a temporary workaround. Direct memory is not automatically good; it is the first-order capability that trace-derived learning later improves.

Need 2: Import External Knowledge Into Internal Form

Memory creation does not only happen by writing new artifacts from the current session. The system should also import external knowledge bases, documents, repositories, source snapshots, tickets, notes, or prior archives into its own internal form. Import is not copying; it is a distillation (directed context compression) and constraining (narrowing interpretation space) step. External material is converted into artifacts that obey the receiving system's types, links, quality requirements, provenance rules, and retrieval surfaces.

This matters because much of the memory a system needs already exists elsewhere. A project may have an old wiki, a README forest, issue threads, source snapshots, API docs, chat exports, or another knowledge base. Leaving that material external preserves evidence but does not make it agent-usable. The memory system needs import paths that add structure the external source may not have: summaries, semantic descriptions, typed artifacts, links to existing concepts, authority markers, lifecycle status, and source pointers.

Realistic import methods include:

  • Snapshots that preserve external sources before analysis, so later claims can be audited.
  • Ingest reports or source reviews that classify external material, summarize it, name limitations, and link it into the internal graph.
  • Conversion tools that turn raw text or legacy notes into typed internal artifacts with frontmatter, descriptions, links, and status.
  • Directory or repository ingestion that treats a related file tree as one source unit rather than many disconnected snippets.
  • Re-ingest workflows that rerun classification and connection after the internal KB has changed.
  • Staging in a workshop when the imported material is too large, messy, or contested to promote directly.

The point is not to erase the source's original shape. It is to make enough of that source available in the memory system's own language that future agents can find, trust, activate, and revise it.

sift-kg is a partial fulfillment of this import need in a document-to-graph setting. It turns document folders into a derived knowledge graph with discovered schemas, materialized pipeline stages, confidence scores, provenance, and human-gated merge/relation review. Commonplace would not copy that graph-first architecture directly, but the implementation shows what import requires beyond "upload documents": schema choice, source preservation, deduplication, review state, and derived artifacts that can be regenerated.

Need 3: Preserve Evidence Without Making History The Next Context

The memory system needs a capture substrate that keeps enough source material for later extraction, audit, debugging, and redistillation. Broad trace retention is useful because the future consumer is often unknown when the session occurs.

For a single-user agent harness, broad retention is usually cheap because the stream is mostly text: prompts, model outputs, tool calls, file diffs, command output, and small structured artifacts. Traditional software and current hardware can store that volume trivially compared with the cost of reconstructing missing reasoning later. This justifies storing text traces broadly when retention and redaction policy allow it.

The limits are payload class and scale. Once traces include large binary or media artifacts -- movies, long audio, screenshots, screen recordings, datasets, build products, or telemetry firehoses -- "store everything" stops being a safe default. Multi-user systems also change the calculation: aggregate volume grows faster, traces contain other people's private or regulated material, authority over retention becomes contested, and search pollution becomes a shared operational cost.

For large payloads, the memory system may keep metadata, thumbnails, transcripts, hashes, sampled excerpts, or provenance pointers instead of raw files. For shared use, it may need per-user retention policies, access controls, or externally managed blobs.

But store-everything is only a capture posture. Session history should not be the default next context because persistence and loading are separate decisions. Raw traces should usually remain outside the acting agent's ordinary context and load only for provenance checks, dispute resolution, debugging, redistillation, or evaluation.

Realistic methods for this need include:

  • Complete session traces with tool calls, timestamps, outputs, errors, and final artifacts.
  • Structured event logs that capture actions, decisions, errors, approvals, and produced artifacts without preserving every token.
  • Artifact provenance records that link durable notes, policies, decisions, tests, scripts, or plugins back to the sessions and sources that produced them.
  • Redacted trace stores where secret scrubbing and retention policy run before extraction or model inspection.
  • Selective capture in high-risk domains where privacy, legal retention, media payloads, or data volume make broad logging unacceptable.

The design choice is not "store everything or summarize everything." It is "store eligible material under policy, then load selectively." Distillation (directed context compression) works better when source material remains available, but the source material still needs governance.

Need 4: Use Trace-Derived Extraction As Meta-Learning

Once direct memory creation exists, the system can improve it by studying what direct authoring missed. Session logs contain latent memory-creation opportunities, but they differ by oracle strength.

Corrections are strongest because the log contains both a negative and positive signal. Silent failures are weaker: the task appears completed, but the trace shows errors, retries, fallback paths, warning output, or weakened guarantees. Preferences are distributed over many accept/reject events. Procedures show up as recurring action sequences. Discoveries and broad syntheses have the weakest immediate oracle; their value often appears only through later reuse.

The system therefore needs a meta-learning taxonomy ordered by signal quality, not by topic popularity. It should start where the oracle is strongest and delay automation where the oracle is weak. In bitter lesson terms, memory systems should prefer scalable search and learning where feedback is strong, while keeping weak-oracle knowledge work in reviewable artifacts until evaluation can justify relaxation.

Realistic methods include:

  • Narrow, schema-constrained extraction prompts for one signal type at a time.
  • Classifiers or simple rules for explicit events: user correction, command failure, retry, fallback, approval, rejection, or repeated tool sequence.
  • Batch analysis over many sessions for preferences, procedures, and recurring failure patterns.
  • Human or agent review queues for weak-oracle candidates such as discoveries, broad design principles, or high-impact policy changes.
  • Confidence, source pointers, and candidate status fields so extracted items do not masquerade as durable knowledge.

Reviewed trace-mining systems provide evidence for this meta-learning path, but not a complete solution. Trace-derived learning techniques in related systems shows many systems mining traces into tips, memories, rules, procedures, and skills. Commonplace shows the first-order path: useful memory can be written directly when the agent or maintainer already understands the artifact to create. Both paths still need evaluation, promotion, retirement, and evidence that memory changes future behavior.

cass-memory partially fulfills the trace-to-procedure path: it mines sessions from multiple coding agents into a shared YAML playbook, tracks feedback, and stores source sessions on each rule. REM fulfills a narrower trace-to-fact path by storing episodes and compressing clusters into short semantic memories. The contrast is useful because it separates the extraction problem from the later questions of lifecycle, authority, and behavioral uptake.

Need 5: Serve Multiple Consumers, Not One Retrieval Interface

A memory system has more than one consumer. A human maintainer asks why a decision was made. An acting agent needs constraints before it acts. A context scheduler needs compact metadata and budget rules. A reviewer needs provenance. A learning loop needs candidate observations. Governance needs authority, redaction, retention, and retirement state. These consumers should not be forced through one retrieval interface.

No single surface satisfies all of these needs. Search is useful for question answering. Navigation is useful when the reader must follow articulated relationships. Triggered activation is useful when the agent would not know to ask. Trace replay is useful when a summary is under suspicion. Active work artifacts are useful when the task is not yet finished.

Realistic method families include:

  • Search over traces, observations, source summaries, and durable artifacts for direct questions.
  • Link navigation and indexes for reasoning through curated knowledge rather than isolated snippets.
  • Progressive-disclosure pointers: descriptions, tags, source links, episode summaries, cue titles, and compact evidence records that help the context engine decide what not to load.
  • Retrospective episodes for "what happened when we tried this?" questions, where a bounded effort needs narrative recall.
  • Active workshops (work-in-flight spaces with state and expiration) or work-surface artifacts for current state, unresolved alternatives, task queues, experiments, and discussion threads.
  • Trace excerpts for audit and redistillation when compressed memory is insufficient.

The important distinction is between retrospective memory and active work. A functioning KB needs a workshop layer, not just a library: work in motion needs state, dependencies, expiration, and unresolved alternatives. Retrospective episodes should not replace the active work surface.

Need 6: Activate Behavior-Changing Memory Before The Mistake

The system must not merely answer "what do we know?" It must sometimes answer a question the agent did not ask: "what past lesson applies to the action I am about to take?"

This is the system-definition side of memory. Continual learning's open problem is behaviour, not knowledge: adding retrievable facts is easier than changing future action. A stored correction only matters operationally if it fires before the agent repeats the corrected behavior.

Realistic activation methods form a range:

  • Always-loaded instructions for stable, high-frequency, low-cost constraints.
  • On-reference loading when a document, source, issue, or artifact is explicitly mentioned.
  • On-invoke loading through skills, tools, or workflows that carry their own instructions.
  • On-situation loading through typed cues that match proposed actions, task domains, risk markers, or decision spaces.
  • Checklists, tests, scripts, lint rules, approval gates, or runtime guardrails when the lesson can be moved from prose toward symbolic enforcement.

Typed cue indexes provide the on-situation loading form of this family. A cue can carry a trigger condition, lesson, source pointer, role, consequence weight, and placement target. Matching can use rules, embeddings, action classifiers, or LLM relevance judgments. The choice depends on consequence, false-positive tolerance, and cost.

The harder requirement is behavioral faithfulness. A cue that fires and enters context has not succeeded unless it changes downstream action in the intended direction. High-priority system-definition material needs evidence that it earns its context budget: WITH/WITHOUT comparisons, perturbation tests, post-action trace audits, or other checks against behavior. Large Language Model Agents are not Always Faithful Self-Evolvers is the cautionary example: written or compressed memories can improve measured behavior without being used in the way their designers assume.

Synapptic is the clearest reviewed system that treats activation as something to test rather than assume. It extracts behavioral guards from Claude Code sessions, runs WITH/WITHOUT ablations with an LLM judge, records per-model verdicts, and excludes guards marked redundant or harmful before compiling them into assistant-facing memory surfaces. The oracle is still soft, but the test is aimed at the right question: whether the remembered rule changes behavior enough to earn its prompt budget.

Need 7: Promote Only When Future Value Exceeds Maintenance Cost

The system needs a promotion path from candidate memory to durable artifacts, but promotion is not automatically good. Durable artifacts create obligations: review, update, invalidation, connection, retirement, and consistency with sources. System-definition promotions add risk because they change behavior.

Candidate observations should therefore remain cheaper and less authoritative than library notes, policies, instructions, tests, or scripts. Promotion is justified when future retrieval or activation value exceeds review and maintenance cost.

Realistic promotion destinations include:

  • Knowledge notes, decision records, source reviews, indexes, and negative-result records for material whose value is explanatory.
  • Procedures, skills, runbooks, checklists, and instructions for recurring work patterns.
  • Tests, validators, linters, scripts, plugins, runtime extensions, and guardrails when the learned rule is deterministic enough for codification (committing procedure to a symbolic medium).
  • Always-loaded policy only when the rule is stable, high-frequency, and cheap enough to spend context on every session.
  • Existing domain work surfaces, such as tickets, product configuration, dashboards, CRM records, reports, or documentation, when those are the actual source of authority.

This promotion path is a constraining (narrowing interpretation space) gradient: prose candidate, curated note, instruction, checklist, test, script, guardrail. Stronger constraints reduce interpretation space and increase reliability, but they also increase brittleness and maintenance cost. Spec mining is codification's operational mechanism gives the practical loop for moving repeated failures or procedures toward executable checks.

Promotion thresholds should depend on signal type and role. One serious correction may deserve a candidate cue immediately. A preference may require several consistent decisions across sessions. A discovery may need later reuse. Always-loaded or enforced system-definition artifacts should require stronger review than knowledge notes.

Need 8: Keep Source Of Truth And Compiled Views From Drifting

Memory systems often create derived surfaces: a note produces a cue, a policy produces a checklist, a convention produces a lint rule, a guide produces an AGENTS.md excerpt, or a trace-derived observation produces a generated reminder. These artifacts put knowledge where it can act, but become dangerous when they turn into independent sources of truth.

The system needs source-of-truth rules for every behavior-changing derivative. A library-derived cue should be treated as a compiled view, not as a separate policy. It should carry provenance, source version or hash, generation time, owner, and regeneration rules. If the source changes, the cue must regenerate or be marked stale. Direct edits to compiled cues should either flow back to the source or remain candidate-stage material.

This need applies more strongly to system-definition artifacts than to ordinary summaries because drift can change behavior. Always-loaded context mechanisms in agent harnesses shows that behavior-shaping context can live in many places: prompts, files, tool descriptions, capabilities, configs, skills, and memory. The more surfaces exist, the more explicit the authority model must be.

The same Synapptic design also provides a concrete compiled-view pattern: the YAML profile is the durable state, while Claude memory, Cursor rules, Copilot instructions, AGENTS.md, and other assistant files are render targets with target-specific filtering. That is closer to the right authority model than treating every emitted prompt file as an independent memory.

Need 9: Retire, Redact, Supersede, And Relax Memory

Learning is incomplete without forgetting and revision. Raw traces can contain secrets or obsolete assumptions. Observations can be duplicates, wrong, low-value, or superseded. Cues can grow stale. Policies can become too broad. Tests can fossilize temporary workarounds.

The memory system needs lifecycle fields and maintenance operations at every layer. Append-only capture is useful for provenance, but indexes, extracted observations, and activated policy surfaces must support redaction, decay, supersession, retirement, and relaxation.

Realistic methods include:

  • Retention classes and redaction status on traces before model extraction.
  • Candidate, accepted, superseded, rejected, and retired states on extracted observations.
  • Duplicate clustering and source consolidation for repeated observations.
  • Recency decay tempered by consequence and recurrence, so old high-impact corrections do not disappear merely because they are old.
  • Retirement tests for cues that fire often but do not change behavior or produce too many false positives.
  • Relaxation paths from rigid enforcement back to prose guidance when a codified rule proves brittle.

This is the lifecycle side of the same context-efficiency problem. Every stale artifact competes for attention, search rank, review time, or behavioral authority.

Reviewed systems show both partial fulfillment and a common failure mode. cass-memory has candidate/established/proven/deprecated states, harmful-feedback weighting, and decay. REM defines active, contradicted_by, and superseded columns but does not wire them into an actual update path. The distinction is architectural: lifecycle metadata is only memory management if some process reads and acts on it.

Need 10: Make Authority Explicit

The memory system must say who or what is allowed to write, promote, activate, enforce, revise, and retire memory. The comparative review's agency trilemma remains decisive: no option combines high agency, high throughput, and high curation quality without trade-offs. Agent-managed memory has task context but spends reasoning budget. External services scale but guess what matters. Humans curate well but slowly. Learned policies need strong oracles.

Authority should vary by risk:

  • Automatic systems can capture traces and propose low-authority candidates.
  • Extractors can write observations with confidence, source pointers, and candidate status.
  • Context engines can activate low-risk cues under explicit budget and ranking rules.
  • Human or reviewed-agent workflows should approve durable knowledge artifacts when source interpretation matters.
  • High-priority system-definition surfaces, always-loaded instructions, checks, guardrails, and executable policies need the strongest review or behavioral evaluation.
  • Retirement and relaxation should be scheduled work, not accidental decay.

This requirement prevents a memory system from silently rewriting the agent's behavior. A system can choose automation, human review, learned policy, or hybrid authority, but the choice is part of the architecture.

Need 11: Evaluate Memory By Effects, Not By Existence

The system should not count "memory written" as learning. It should evaluate whether memory improved the future task, answer, artifact, or behavior.

Evaluation dimensions include:

  • Direct retrieval: can the system answer the question that motivated storage?
  • Navigability: can an agent or human follow links and provenance to understand why an answer is trustworthy?
  • Activation: does relevant policy fire before the action where it matters?
  • Behavioral uptake: does the fired memory change the downstream plan, tool use, or artifact in the intended direction?
  • Context efficiency: does the memory earn the tokens, latency, and attention it consumes?
  • Lifecycle health: are stale, duplicate, low-value, sensitive, or superseded memories retired or demoted?
  • Promotion economics: do durable artifacts get reused enough to justify their maintenance burden?

These evaluation dimensions are separable. QA-style retrieval tests can pass while activation fails. A cue can fire while behavior remains unchanged. A note can be accurate but too hard to find. A policy can become harmful after the domain changes.

A Practical Build Order Follows The Needs

The needs do not imply one integrated architecture. They imply a practical order: build where the signal is strongest, the risk is lowest, and the behavioral test is clearest.

  1. Support direct authoring of notes, instructions, checks, scripts, plugins, and indexes as the first-order memory operation.
  2. Add import paths for existing external knowledge bases, documents, repositories, and source archives.
  3. Capture eligible traces with provenance, redaction, and retention rules so later meta-learning has evidence.
  4. Extract explicit corrections and silent failures as missed memory-creation opportunities before trying to infer discoveries.
  5. Test whether correction-derived cues activate in plausible future situations.
  6. Test whether activated cues actually change behavior.
  7. Add promotion queues for high-confidence candidates whose future value exceeds maintenance cost.
  8. Add work-surface and episode support only where retrospective narrative or active state materially improves future tasks.
  9. Move repeated, high-confidence lessons toward instructions, checks, scripts, plugins, or guardrails only when authority and evaluation are explicit.

Direct authoring can produce useful memory before trace mining exists at all. Trace-derived extraction is a meta-learning layer over that base: it looks for memory that should have been created but was not. The failure tests are sequential: can correction extraction produce useful candidates, can those candidates activate, and do activated cues change behavior? If not, more trace processing only makes the system more elaborate. The memory system should grow from validated needs, not from an attractive taxonomy.

What Remains Open

The hardest open problem is structural pattern detection across sessions. Many important lessons do not share keywords: stale pricing tables, outdated runbooks, and deprecated templates can all be instances of "derived artifacts drift from sources of truth." Recognizing that causal structure requires deeper analysis than ordinary search.

Discovery extraction also remains weak. Corrections and failures have visible signals; discoveries often have only surprise, elaboration, or later reuse. The realistic stance is to surface discovery candidates, not automatically graduate them.

The boundary between learned memory and work-surface authority is domain-dependent. Software projects have tests, linters, code review, issue trackers, and deployment gates. Other domains may lack those surfaces or have different authorities. A memory design must first name the domain's observable traces, recurring tasks, durable work surfaces, evaluable outcomes, and authority to modify behavior.

Finally, learned memory-management policy is attractive but oracle-dependent. Where the domain has clear success metrics, a learned policy may outperform inspectable heuristics. In open-ended knowledge work, reviewable rules, provenance, and behavioral tests remain the safer default. Memory management sits on the same bitter-lesson boundary: relax into learned policy where feedback is good, and keep artifact-side control where feedback is weak.


Relevant Notes: