Hindsight
Type: ../types/agent-memory-system-review.md · Status: current · Tags: trace-derived
Hindsight, from Vectorize, is a production agent memory backend for retaining agent/user traces, extracting structured facts, retrieving across semantic, lexical, graph, and temporal paths, and reflecting over raw facts, consolidated observations, mental models, and directives. The inspected repository is not just an SDK wrapper: it contains the FastAPI service, MCP tools, PostgreSQL/Oracle database layer, background worker, generated clients, OpenClaw and framework integrations, local embedded packages, documentation, benchmarks, and monitoring assets.
Repository: https://github.com/vectorize-io/hindsight
Reviewed commit: 9784f6573a5bcba6ac6fd9dfb70929e5318857ce
Last checked: 2026-05-16
Core Ideas
Retain turns submitted traces into extracted memory units, not opaque transcript rows. retain_async and retain_batch_async accept text content plus context, event date, document id, metadata, entities, tags, observation scopes, update mode, and optional fact-type override, then route into a retain orchestrator that chunks, extracts facts, embeds them, resolves entities, builds links, stores chunks/documents, and writes fact rows in batches (memory_engine.py, orchestrator.py, types.py). Raw submitted text remains as documents and chunks; derived facts become memory_units with fact_type values of world, experience, or observation (models.py).
The fact taxonomy separates source claims from consolidated observations. Hindsight's durable fact rows distinguish world facts, bank experiences/actions, and observations. World and experience facts are extracted from retained content; observations are later produced by consolidation and stored in the same memory_units table with proof_count, source_memory_ids, and history columns added by the "new knowledge architecture" migration (models.py, new_knowledge_architecture.py). That makes observations derived knowledge artifacts with explicit source-memory lineage, not a separate handwritten note layer.
Storage is database-first but multi-surface. The canonical runtime substrate is PostgreSQL with pgvector HNSW indexes, JSONB metadata, arrays for tags/source ids, entity/link tables, async-operation rows, audit logs, webhook tables, and optional file storage in PostgreSQL bytea, S3, GCS, or Azure (models.py, retrieval.py, postgresql.py, audit.py). The repo also supports embedded PostgreSQL through pg0, external PostgreSQL, Oracle AI Database, generated SDK clients, HTTP, MCP, and local background servers (pg0.py, server.py, README.md).
Recall is a four-way retrieval pipeline over fact types. recall_async runs N-by-four retrieval: for each requested fact type, Hindsight runs semantic vector search, BM25/full-text search, graph retrieval, and temporal retrieval, then merges with reciprocal rank fusion, reranks, diversifies, and token-filters results (memory_engine.py, retrieval.py, fusion.py, graph_retrieval.py). Tags and tag groups are real retrieval controls, not only labels: they filter retain, recall, observation search, mental-model refresh, and directive injection.
Consolidation creates observations as a background learning layer. The consolidation engine consumes unconsolidated world/experience facts, asks an LLM for create/update/delete actions, stores observations as memory_units(fact_type='observation'), records source ids, aggregates time fields and tags, and marks source memories as consolidated (consolidator.py, prompts.py). This is the main trace-derived distillation step: raw retained traces produce facts, and repeated/new facts produce generalized observations with proof counts.
Reflect is an agent/tool loop with authority gradients. reflect_async does not simply stuff all memory into a prompt. It builds a reflection agent that can search mental models, search observations, recall raw facts, and expand chunks/documents; it loads bank profile and active directives, computes budget-dependent iteration limits, and returns text plus tool/LLM traces and used memory ids (memory_engine.py, agent.py, tools.py, models.py). The prompt explicitly treats mental models as highest-quality summaries, observations as consolidated but possibly stale, and raw facts as ground truth fallback (prompts.py).
Directives are stored rules with prompt-level force. Hindsight has a dedicated directives table and CRUD methods; reflect_async loads active directives scoped by tags, injects them into the system prompt as mandatory rules, and reminds the agent that responses should be rejected if they violate them (memory_engine.py, prompts.py, new_knowledge_architecture.py). These are system-definition artifacts: they instruct and constrain future reflect outputs.
The production surfaces are broad. The FastAPI app exposes bank, retain, file-retain, recall, reflect, memory, document, entity, operation, directive, mental-model, tag, webhook, and monitoring endpoints; the shared MCP layer registers retain, sync retain, recall, reflect, bank operations, mental-model operations, directives, document/memory reads, operation controls, and cleanup tools (http.py, mcp_tools.py). The OpenClaw plugin starts an embedded Hindsight server or uses an external API, derives banks from agent context, retains async, recalls before prompt construction, and applies per-bank missions (index.ts).
Comparison with Our System
| Dimension | Hindsight | Commonplace |
|---|---|---|
| Primary substrate | PostgreSQL/Oracle tables with pgvector/full-text/entity/link indexes, async operations, audit logs, and object/file storage | Git-tracked Markdown notes, type specs, schemas, generated indexes, review artifacts, and command-line validators |
| Raw source layer | documents and chunks retain submitted text and source chunks |
Source snapshots, logs, work artifacts, and note citations |
| Derived knowledge | Extracted world/experience facts, consolidated observation facts, mental models |
Typed notes, reviews, indexes, instructions, reports |
| Strong authority artifacts | Directives, tool schemas, bank config, operation validators, tenant filters, prompt missions | Instructions, skills, type specs, validation rules, review gates |
| Retrieval | Semantic, BM25, graph, temporal, RRF, reranking, MMR, tag filters | rg, descriptions, indexes, authored links, skills, generated reports |
| Learning loop | Trace-to-fact extraction, background observation consolidation, refreshable mental models, directive injection | Human/agent review, workshop-to-library promotion, validation, semantic review |
| Production controls | Multi-tenant schemas, API keys, operation validators, async workers, audit logs, webhooks, metrics | Git history, deterministic validation, explicit collection contracts, local CLI workflows |
Hindsight is much more production-backend-shaped than commonplace. It treats memory as an online service with tenants, API keys, async workers, webhooks, generated clients, MCP, monitoring, and embeddable runtime packages. Commonplace treats memory as a repository knowledge system where durable artifacts are inspectable Markdown and authority is carried by file type, collection convention, validation, and review.
The strongest conceptual overlap is the separation between source traces, distilled knowledge artifacts, and behavior-shaping system-definition artifacts. Hindsight's raw submitted text and chunks are evidence. Extracted world/experience facts are knowledge artifacts when retrieved as context. Observations are derived knowledge artifacts with stronger summarizing authority but still traceable through source_memory_ids. Directives, operation validators, tenant filters, bank config, prompt missions, and MCP/HTTP schemas are system-definition artifacts because they configure, route, filter, validate, instruct, or enforce future behavior.
The main divergence is inspectability and governance. Hindsight has good operational lineage inside the database, especially for observations and async operations, but most retained state is opaque to git-native review. Commonplace is weaker as a live service but stronger as an artifact governance system: claims, instructions, and reviews can be read, linked, validated, diffed, replaced, and discussed as files.
Hindsight also demonstrates a real hierarchy that commonplace mostly describes in methodology: raw facts, observations, mental models, and directives are distinct runtime surfaces with different activation timing and authority. That is valuable evidence for the claim that "memory" must be decomposed by storage substrate, representational form, lineage, and behavioral authority.
Read-back: both — agents can recall or reflect through memory tools, while OpenClaw recall and active directives can be inserted during prompt construction.
Borrowable Ideas
Observation as a first-class derived fact type. Commonplace could borrow the explicit distinction between raw retained facts and consolidated observations for workshop logs or review outputs. The borrowable part is not the database schema; it is the lifecycle label: an observation should carry source ids, proof count or support count, tags/scope, and history. Ready as vocabulary; implementation needs a concrete trace-derived workflow.
Directives as a separate authority lane. Hindsight does not make every memory equally instructive. Directives are stored and injected through a mandatory prompt section. Commonplace already separates instructions from notes, but Hindsight's tag-scoped directive injection is a useful runtime analogue if we build task-specific context packs.
Treat retrieval traces as product observability. Reflect returns tool traces, LLM traces, used memory ids, used observation ids, used mental-model ids, and applied directives. Commonplace semantic review has evidence artifacts, but ordinary navigation/retrieval still leaves little structured trace. A lightweight trace of "what evidence affected this answer" would improve reviewability.
Expose write tools only behind validation hooks. Hindsight's operation validator extension can reject or enrich retain, recall, reflect, consolidation, and bank writes before execution, and post-operation hooks receive results and usage signals (operation_validator.py). If commonplace exposes more agent-write surfaces, validation hooks should come before convenience.
Use async operation rows as a user-facing lifecycle surface. Hindsight makes long retain/file/consolidation/webhook work visible through operation ids, status, retry, cancel, worker polling, and child operations. Commonplace commands are mostly synchronous; if future review or ingestion tasks become long-running, operation rows or file-backed equivalents would be cleaner than ad hoc logs.
Keep embedded/local mode real. Hindsight's hindsight-all package can start an API server in a background thread on a free localhost port with embedded PostgreSQL, letting integrations use the same HTTP client path as deployments. Commonplace should keep this pattern in mind for any future service layer: local-first adoption should not require a hosted control plane.
Trace-derived learning placement
Trace source. Hindsight qualifies as trace-derived learning. The source trace is user/agent/tool content submitted through retain, file-retain, MCP retain, framework integrations, LLM wrappers, and plugins. Retain inputs can carry context, timestamps, document ids, metadata, entities, tags, and observation scopes, so a deployment can ingest chat turns, tool traces, documents, and operational events as scoped evidence (mcp_tools.py, index.ts, types.py).
Extraction. Extraction is multi-stage. Retain splits/chunks submitted text, asks a retain LLM to extract facts, validates/coerces fact records, generates embeddings, resolves entities, and creates temporal, semantic, entity, and causal links. Consolidation then consumes unconsolidated world/experience facts and asks a consolidation LLM to create, update, or delete observation facts. Mental-model refresh can rerun a stored source query through reflect and overwrite or delta-edit a pinned synthesis (fact_extraction.py, orchestrator.py, consolidator.py, memory_engine.py).
Storage substrate. Raw retained text persists in documents, chunks, and optional file/object storage. Extracted facts persist in memory_units; entities, co-occurrences, and links persist in graph-adjacent tables. Observations persist as memory_units with fact_type='observation', source ids, proof count, timestamps, tags, and history. Mental models and directives persist in their own tables. Operations, audit logs, webhooks, and webhook deliveries persist as service-control tables.
Representational form. Raw traces and extracted facts are prose wrapped in symbolic database rows. Entities, tags, timestamps, fact types, document ids, source ids, validation status, and operation states are symbolic. Embeddings and vector indexes are distributed-parametric retrieval state. The operative behavior-shaping form is mixed: prose facts and observations become active through symbolic filters, graph traversal, vector search, reranking, and prompt/tool schemas.
Lineage. Lineage is strongest from observations back to source facts: observations store source_memory_ids and proof counts, and reflect can include source facts for observation results. Raw chunks preserve source text under document/chunk ids. Lineage is weaker for LLM extraction decisions themselves: a fact row carries context, metadata, entities, tags, and time fields, but not a complete extraction prompt/version/judge record for every fact. Async operations and audit logs improve operational lineage, not semantic provenance.
Behavioral authority. Raw chunks, documents, world facts, experience facts, entity observations, and observations are knowledge artifacts when consumed as evidence, context, explanation, or advice. Embeddings, indexes, tag filters, rank fusion, rerankers, budgets, operation validators, tenant filters, and worker status are system-definition artifacts because they route, rank, validate, isolate, and schedule behavior. Directives are the clearest system-definition artifacts: they are injected into reflect prompts as mandatory rules. Mental models sit between the two: they are knowledge artifacts when searched as curated summaries, but their stored source query and refresh trigger have system-definition force over regeneration.
Scope. The default scope is per bank, with optional tenant schema isolation and tag/tag-group subscopes. Integrations derive bank ids from application context, so Hindsight can operate per user, per agent, per session, per project, or per tenant depending on deployment policy.
Timing. Retain can run synchronously or as async operations. Consolidation runs as background work after retain, through broker/worker task processing. Reflect is online at answer time. Mental-model refresh is explicit or trigger-driven. Webhooks and audit logging add asynchronous side effects around the memory lifecycle.
Survey placement. Hindsight belongs in the trace-to-fact plus trace-to-observation family, with a separate directive authority lane. It strengthens the survey claim that trace-derived systems need a split between raw evidence and distilled artifacts: Hindsight keeps source chunks/facts, derives observations, lets humans or systems create mental models/directives, and uses retrieval/tool traces to expose what shaped a reflect response.
Curiosity Pass
The "biomimetic" framing is less important than the authority layering. The durable design contribution is not that memories resemble human memory; it is that Hindsight separates raw facts, observations, mental models, directives, retrieval controls, and operational controls.
Observations are the best-grounded distilled layer. They carry source ids and proof counts, and they remain searchable through the same fact retrieval machinery. Mental models may be more readable, but the source-query refresh path depends on reflect behavior and LLM synthesis quality.
Directives are powerful enough to need governance. The code treats them as mandatory prompt rules. That makes them useful for steering agents, but also dangerous if created through weakly governed write paths. Tenant auth, operation validators, tags, and audit logs are therefore not peripheral enterprise features; they are part of the memory safety model.
The database substrate buys speed and service ergonomics at the cost of ordinary review. Hindsight can do live recall, background jobs, tenant isolation, metrics, and webhooks in ways a file-only KB cannot. But the retained artifacts are harder to diff, review, merge, or reason about without purpose-built UI/API surfaces.
The benchmark claims should be treated separately from this implementation review. The repo includes benchmark tooling and the README reports LongMemEval strength, but this review inspected the memory backend architecture rather than reproducing benchmark results or judging the paper methodology.
What to Watch
- Whether extracted fact rows gain richer semantic provenance: extraction prompt version, model, confidence, source chunk offsets, and review state.
- Whether directives get stronger promotion/review workflows, since they carry instruction-level authority inside reflect.
- Whether mental models converge with observations or stay a separate user-curated/refreshed synthesis layer.
- Whether local embedded mode remains operationally equivalent to server mode as multi-tenant, worker, webhook, and storage features expand.
- Whether OpenClaw, Claude Code, Codex, LiteLLM, and framework integrations preserve the same bank/tag/authority semantics or drift into connector-specific behavior.
- Whether Hindsight's trace/tool/LLM traces become usable evaluation artifacts for downstream answer quality rather than only observability/debugging data.
Bottom Line
Hindsight is one of the strongest production memory backends in this review set: it has a real trace-to-fact retain pipeline, multi-mode retrieval, derived observations, reflect-time tool use, prompt-level directives, tenant isolation, async operations, audit logs, webhooks, MCP, integrations, and local embedded deployment. Commonplace should not copy its database substrate wholesale, but should take seriously its artifact split: raw traces/sources, extracted facts, consolidated observations, curated mental models, mandatory directives, and operational controls are different retained artifacts with different lineage and behavioral authority.
Relevant Notes:
- Trace-derived learning techniques in related systems - extends: Hindsight is a production trace-to-fact and trace-to-observation system with directive authority as a separate lane.
- Axes of artifact analysis - exemplifies: Hindsight's raw chunks, facts, observations, mental models, directives, embeddings, indexes, and operations need separate substrate, form, lineage, and authority labels.
- Knowledge artifact - defined-in: raw facts, chunks, observations, and mental models usually advise or evidence later responses.
- System-definition artifact - defined-in: directives, validators, tenant filters, schemas, indexes, and tool surfaces instruct, route, validate, rank, or configure behavior.
- Use trace-derived extraction - exemplifies: retain and consolidation extract durable memory artifacts from traces rather than storing transcripts only.
- Knowledge storage does not imply contextual activation - contrasts: Hindsight tightly couples storage to recall and reflect-time activation.