Hindsight

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

Hindsight (vectorize-io/hindsight) is Vectorize.io's open-source agent memory service. It ingests conversation and document content, extracts typed facts with an LLM, stores them alongside entity and causal links in PostgreSQL with pgvector, and retrieves them through four parallel strategies fused by reciprocal rank fusion and cross-encoder reranking. The system is a Python/FastAPI monorepo (hindsight-api-slim) with a Next.js control plane, Rust CLI, auto-generated Python/TypeScript/Rust clients, and a growing set of framework integrations (LiteLLM, LangGraph, CrewAI, Pydantic AI, AG2, Claude Code, Codex, OpenClaw, Paperclip, and more). Versions 0.4.x through 0.5.0 have shipped since the last review, along with an embedded mode (hindsight-embed) that runs the full engine as a self-managed local daemon with no external Postgres.

Core Ideas

Three fact types shape the physical store, not just labels. VALID_RECALL_FACT_TYPES is now {"world", "experience", "observation"}. The opinion type was removed in Alembic migration g2h3i4j5k6l7_remove_opinion_fact_type and backfill-deleted in i4d5e6f7g8h9_delete_opinions. Each type has a partial HNSW embedding index plus per-bank variants (a4b5c6d7e8f9_fix_per_bank_vector_index_type), so retrieval plans use a type-scoped vector index rather than a full sequential scan. The taxonomy is simpler than the previous four-way split but more aggressively embedded in the query planner — the retrieve_semantic_bm25_combined function builds a UNION ALL with one arm per fact_type to force the planner onto per-type indexes.

Retain is a multi-phase LLM pipeline bounded by budgets. retain/orchestrator.py orchestrates fact extraction (fact_extraction.py, ~2300 lines), entity resolution against canonical entities, embedding generation, chunk storage, and link creation (entity, semantic kNN at top-5 similarity ≥ 0.7, and causal links causes/caused_by/enables/prevents). Extraction has three schema variants — standard ExtractedFact, ExtractedFactVerbose (forces every field to be detailed), and VerbatimExtractedFact (uses the original chunk text as fact text and only extracts metadata) — plus a no-causal variant. Fact type is derived from the LLM's world/assistant classification, with assistant mapped to experience. The system is still write-heavy — every retain triggers extraction, entity matching (trigram + SequenceMatcher), and post-commit async consolidation.

Graph retrieval is now link expansion, not spreading activation. MPFP with meta-path traversal is gone. The new default is LinkExpansionRetriever (search/link_expansion_retrieval.py, ~550 lines) which expands from semantic seeds through three signals in a single CTE: entity co-occurrence via unit_entities with LATERAL caps at graph_per_entity_limit (default 200), precomputed semantic kNN edges (each fact linked to top-5 similar facts at retain time), and explicit causal chains scored at weight + 1.0. A graph_expansion_timeout drops entity expansion entirely when it blows the budget. The move trades the breadth of meta-path traversal for a simpler, bounded, and vastly cheaper query — most of the graph work now happens at retain time.

Recall is genuinely four-way and per-fact-type parallel. retrieve_all_fact_types_parallel runs semantic + BM25 together in one connection (single SQL statement), runs temporal spreading in the same connection when a temporal constraint is extracted from the query, and fans out graph retrieval per fact_type on its own asyncio tasks. Temporal is a deliberate BFS across memory_links with link_type IN ('temporal','causes','caused_by','enables','prevents') capped at per_source_limit=10 and max_iterations=5 — explicitly bounded, not an unbounded spreading activation. Results are fused with RRF (k=60) and cross-encoder reranked (local sentence-transformers or TEI).

Observations are evidence-grounded, auto-consolidated, and maintained by CRUD operations. consolidation/consolidator.py watches newly retained facts and runs an LLM prompt that emits creates, updates, and deletes against existing observations as explicit actions — the closest the system gets to CLAW/ExpeL-style mutation verbs. Observation data model (reflect/observations.py) requires ObservationEvidence with exact quotes and memory timestamps. Trends (STABLE, STRENGTHENING, WEAKENING, NEW, STALE) are computed from evidence timestamps. Observations live in memory_units with fact_type='observation', so they participate in recall alongside world/experience facts.

Reflect is an agentic tool-calling loop with hierarchical retrieval and directive-aware termination. reflect/agent.py runs up to 10 iterations with four retrieval tools — search_mental_models, search_observations, recall, expand — and a done terminator. The hierarchy is still curated → consolidated → raw. New: Directives (engine/directives/) are hard rules injected into prompts (e.g., "Always respond in formal English"). When directives apply, the done tool schema is rewritten to require a directive_compliance field listing how the answer satisfies each rule, and the answer is rejected if it doesn't comply. For reflect output behavior, dispositions and directives form a two-tier policy layer: soft tuning plus hard compliance gates.

Pluggable extension system for auth, tenancy, and operation validation. extensions/ ships TenantExtension (Supabase, API key), HttpExtension (custom middleware), MCPExtension (custom MCP tools), and OperationValidatorExtension (pre-operation validation hooks for retain/recall/reflect/consolidate/mental-model operations). Operators load extensions via env vars like HINDSIGHT_API_TENANT_EXTENSION=mypackage.tenant:MyClass. This is the architectural gateway Hindsight opens for enterprise deployments without forking the core.

Webhooks, durable async operations, and audit logging. webhooks/manager.py delivers event notifications (retain completed, consolidation done, reflect finished) with retry schedule [5s, 5m, 30m, 2h, 5h] and HMAC signing. worker/poller.py and async_operations tables turn every retain/reflect/consolidate into a resumable background operation with metadata types in operation_metadata.py. Multi-tenant schemas (target_schema context variable) support per-tenant row isolation. engine/audit.py provides an AuditLogger with contextvar-scoped capture. These are infrastructure-level concerns the previous review didn't cover; they're now central to the system's positioning as a production backend.

Many integration packages and a claude-code plugin. hindsight-integrations/ ships LiteLLM wrapper, LangGraph, CrewAI, AG2, Strands, Pydantic AI, LlamaIndex, OpenAI Agents, Agno, AI SDK, plus plugins for Claude Code, Codex, OpenClaw, NemoClaw, Paperclip, and OpenCode. The Claude Code plugin (hindsight-integrations/claude-code/) hooks into UserPromptSubmit for auto-recall, Stop for auto-retain, and SessionStart/SessionEnd for daemon lifecycle — injecting recalled memories as additionalContext invisible to the transcript. Framework reach is now a deliberate investment, not a side project.

Comparison with Our System

Dimension Hindsight Commonplace
Substrate PostgreSQL + pgvector; rows with embeddings and typed links Markdown files under git
Taxonomy world / experience / observation (schema-enforced, per-type HNSW) knowledge / self / operational (convention + type system)
Write path LLM extraction + entity resolution + embedding + link creation + async consolidation Human writes markdown; zero LLM cost at write time
Retrieval 4-way parallel (semantic, BM25, link-expansion graph, temporal spreading) + RRF + cross-encoder rg, description scanning, area indexes
Graph Precomputed memory_links (entity, semantic kNN, causal) expanded at query time Explicit markdown links with articulated relationship semantics
Consolidation LLM-driven CRUD actions over observations, evidence-grounded with trends Human curates; /connect proposes
Policy layer Dispositions (soft, 1-5) + Directives (hard, compliance-gated in done tool) Instructions and type rules; convention-bound
Governance Extension system for tenants, HTTP middleware, operation validators Pre-commit hooks + review bundle + semantic QA
Async & events Durable operation queue, webhooks, audit logger, per-tenant schema Git and filesystem; no event bus
Agent integration Framework integrations + Claude Code / Codex / OpenClaw plugins with auto-recall/retain Commonplace CLI + cp-skill-* skills in Claude Code
Inspectability Requires API/UI or SQL to browse; plus audit logs Every note is a readable file in git

Where Hindsight is stronger. Operational depth. Multi-tenant schemas, webhooks with signed retries, durable operation queues, extensions for auth and validation, per-type partial HNSW indexes, pluggable graph retrievers — this is the most production-ready memory system we've reviewed. The directive system closes a specific gap: the ability to enforce hard rules across reflect outputs, something our instructions-as-prose layer can't guarantee. Framework reach means an agent author using LangGraph, AG2, or Claude Code can get memory with a few lines and no separate KB workflow.

Where commonplace is stronger. Curation discipline. Our notes articulate why claims relate (grounds, contradicts, extends) rather than extracting typed edges automatically. Every artifact is inspectable, diffable, and reviewable by a human in seconds. The distillation note defines the vocabulary; our curation discipline is what keeps distillation trustworthy, because the output stays in a human+agent loop rather than an LLM's read-through. We have no auth story, no webhooks, no tenants — and in return, no operational surface to maintain.

The fundamental trade-off has not changed, but the curve has steepened. Hindsight has doubled down on "memory as a service": versioned clients, extension hooks, framework integrations, embedded mode, SOTA benchmark scores. This makes the contrast sharper — Hindsight is optimising for agents that consume memory via API, while commonplace is optimising for collaborative knowledge work where the substrate itself is the artifact.

Borrowable Ideas

Explicit CRUD actions for consolidated knowledge. Hindsight's consolidator emits creates/updates/deletes as typed actions against observations — this matches ExpeL's ADD/EDIT/REMOVE mutation verbs pattern and makes the maintenance loop auditable. If commonplace ever automates synthesis (beyond the current manual /connect), the verb-set pattern is the right shape for the change log. Needs a use case first — we deliberately keep synthesis human-authored, so the prerequisite is identifying a class of changes that are safe to automate.

Evidence-grounded observation model. Observations requiring ObservationEvidence with exact quotes and timestamps is a borrowable discipline. Our notes already cite sources, but we don't require a structured evidence list with machine-readable quote+timestamp fields. For structured-claim notes, adding a lightweight evidence frontmatter could make claim-to-source traceability programmatic. Ready to borrow — small frontmatter extension.

Directive-as-compliance-gate. Hindsight injects hard rules into prompts and rewrites the done tool schema to require compliance confirmation per rule. For commonplace, this maps to constitution-style invariants that must hold on generated artifacts (e.g., "every note has a discriminating description"). Our semantic review bundle is the post-hoc version; directives-in-schema are the pre-commit version. Ready to borrow — could apply in review skills and note-generation templates.

Extension system for operation validation. The OperationValidatorExtension pattern — pre-operation hooks with typed contexts — is a clean way to inject governance without forking the core. For commonplace, the equivalent would be pluggable pre-commit validators scoped per command (commonplace-review, commonplace-connect). Needs a use case first — currently our validators are hard-coded; YAGNI until we have a second consumer.

Per-type HNSW partitioning. Forcing the planner onto partial indexes via UNION ALL per fact_type is a concrete technique. Not directly borrowable (we don't use a vector DB), but the underlying principle — make taxonomy visible to the retrieval layer, not just the storage layer — applies to our area and tag scoping. Our tag-index methodology already points this way.

What We Should Not Borrow

LLM-mediated writes. Every retain still costs an LLM call (often several — fact extraction, entity resolution, consolidation). Adopting this would break our near-zero-cost write path and introduce silent extraction errors. For a human-authored KB, the savings from automated extraction do not offset the loss of authorial control.

PostgreSQL-as-substrate. The capabilities that make Hindsight production-grade (per-tenant schemas, partial HNSW indexes, async operations, webhooks) are only possible because the substrate trades direct file-level inspectability for database-backed operational structure. This is the inverse of our inspectable-substrate thesis and a direct counterexample in the design space.

Fully automated consolidation. The evidence-grounded observations are a step up from the previous version, but the consolidator still auto-promotes without human review. The curation discipline we maintain through manual writing would be undermined if an LLM decided when claims graduate into first-class notes.

Curiosity Pass

What property does Hindsight now claim? That biomimetic fact typing (world/experience/observation), LLM-extracted structure at retain time, and agentic hierarchical retrieval at reflect time produce measurably better long-term memory than RAG or graph-only alternatives. LongMemEval SOTA (independently reproduced by Virginia Tech per README) remains their benchmark evidence. The removal of opinion and the swap from MPFP to link-expansion are worth interrogating — if "four fact types plus meta-path graph" was the winning formula a few months ago, what changed?

Does the mechanism transform data or relocate it? Retain still genuinely transforms — LLM extraction produces structured JSON with entities and causal hints, entity resolution canonicalises, and consolidation synthesises observations. Reflect also transforms via tool-use reasoning. But the graph retrieval layer has gotten thinner — link expansion is closer to vector search + entity join than a genuine graph algorithm. The MPFP ablation framing ("mass propagation along meta-paths") no longer applies; the current path is short and shallow. If the benchmark scores held across that swap, the graph layer may have been doing less than its complexity suggested.

What is the simpler alternative? Vector search over an embedded fact-type dimension + cross-encoder reranking, with separate consolidation as a background job. The current LinkExpansionRetriever is not far off this; entity co-occurrence is the main extra signal. If an ablation showed entity expansion adds ≤5% over semantic + BM25 + rerank, the graph complexity is decoration. The system's own evolution (MPFP → LinkExpansion) hints this way.

What is the ceiling? Bounded by LLM quality for extraction, entity resolution quality for canonicalisation (still trigram-based, still susceptible to "NYC" ≠ "New York City"), and consolidation judgment for what promotes to observations. The directive system is an interesting new ceiling-raiser — hard rules give the system guarantees the LLM cannot override — but it moves the compliance problem into the directive-writing loop.

Trace-derived learning placement

Trace-derived learning placement. Hindsight ingests conversation/document content via retain, extracts typed facts through an LLM under structured schemas (ExtractedFact, ExtractedFactVerbose, VerbatimExtractedFact), and promotes them into PostgreSQL-stored memory_units with explicit entity, semantic-kNN, and causal links. Trace source: client-submitted content (any text — conversation turns, documents, tool outputs) parameterised by optional context, event_date, tags, and metadata. Trigger boundaries are per-retain-call (per-turn is the typical case in the Claude Code plugin; Stop hook batches with a sliding window). Extraction: an LLM extracts facts with who/what/when/where/why, entities, and within-batch causal relations — the oracle is the extractor LLM itself, with schema validation and optional disabling of causal fields. Promotion target: database-backed symbolic artifacts (memory_units with fact_type), not weights. Observations are second-level symbolic artifacts built by LLM-issued creates/updates/deletes over existing observations, with evidence quotes required. Memory is service-owned (PostgreSQL schema per tenant), not ephemeral. Scope: per-bank, generalisable within a bank; banks are explicitly isolated. No cross-bank mining. Timing: online — every retain runs extraction immediately; consolidation runs async post-commit but still within seconds-to-minutes of ingestion.

On the survey's axes: axis 1 — service-owned trace backend (Hindsight defines its own content API, owns the storage format, and separates archive from extraction from consolidation; comparable to OpenViking and REM but with richer per-fact typing and agentic reflection on top). Axis 2 — symbolic artifact learning, never weights. Artifact structure is "typed durable observations with evidence" (close to ClawVault's observation ledger but evidence-grounded and maintained by explicit CRUD verbs). This placement strengthens the survey's claim that service-owned trace backends can mine without agent runtime coupling, and adds one concrete example of evidence-required observation models. It does not warrant a new subtype — the service-owned trace backend bucket already fits; Hindsight raises the bar within it with directive-gated reflect and CRUD-verb consolidation.

What to Watch

  • Does the link-expansion retriever keep benchmark parity with MPFP? The simplification from meta-path propagation to query-time LATERAL joins is a big architectural bet. If benchmark scores held, the meta-path layer was over-engineered; if they drop, the next iteration may reintroduce it in a different form.
  • Does opinion come back? Dropping a fact type is unusual — either a cleanup of a feature that didn't earn its keep, or preparation for a different representation of uncertainty (e.g., confidence fields on world facts). The migration names suggest the former.
  • Are directives composable with dispositions and consolidations at scale? Directives rewrite the done tool schema; dispositions change the system prompt; consolidation writes observations that reflect reads. Three interacting layers with different lifecycles (runtime, per-call, background) could develop corner cases — e.g., an observation promoted when a directive was not in place but retrieved after it was.
  • Does the extension system accumulate production-shaped requirements, or stay thin? Tenancy, operation validators, and HTTP middleware are the current extension points. Whether this stays a clean abstraction or becomes a god-plug for every enterprise hook is the architecture question for the next release.
  • Do framework integrations commoditise memory? With Claude Code, Codex, OpenClaw, LangGraph, LiteLLM, CrewAI, etc., Hindsight is positioning as "the memory layer" across agent runtimes. If this succeeds, memory becomes a service tier that no specific agent KB needs to own — which affects our framing of commonplace vs commercial offerings.

Relevant Notes:

#related-systems #memory-architecture #trace-derived