Hindsight

Type: note · Status: current · Tags: related-systems

Hindsight (vectorize-io/hindsight) is an open-source agent memory system by Vectorize.io that organises memories into biomimetic data structures — world facts, experiences, and mental models — and retrieves them through four parallel search strategies fused with reciprocal rank fusion and cross-encoder reranking. It claims state-of-the-art performance on LongMemEval (independently reproduced by Virginia Tech). Python/FastAPI server backed by PostgreSQL with pgvector, with client SDKs in Python, TypeScript, and Rust.

Core Ideas

Biomimetic memory taxonomy drives storage, not just retrieval. Memories are classified at ingest into four fact_type values — world (general knowledge), experience (agent's own actions), opinion (stance with confidence score), and observation (auto-consolidated insight). Each type gets its own partial HNSW index, so the query planner can retrieve from specific memory types without sequential scans. The taxonomy isn't a post-hoc label; it shapes the physical storage layout.

Retain is an LLM-mediated extraction pipeline, not a store-and-index operation. Every retain call triggers: LLM-based fact extraction (structured JSON with who/what/when/where/why), entity resolution via trigram similarity against existing canonical entities, embedding generation, link creation (temporal, semantic, causal, entity), and async consolidation into observations. This means every write has LLM cost — the system trades write-time computation for read-time accuracy.

Four-way parallel retrieval with late fusion. Recall runs semantic (HNSW vector similarity), BM25 (full-text), graph (MPFP traversal), and temporal (spreading activation) retrieval in parallel, then merges via reciprocal rank fusion (k=60) followed by cross-encoder reranking. No single strategy dominates; the fusion lets each strategy contribute where it's strongest (semantic for paraphrase, BM25 for exact terms, graph for entity chains, temporal for time-anchored queries).

Multi-Path Fact Propagation is a sublinear graph algorithm. MPFP (the default graph retrieval) propagates mass from semantic seed nodes along predefined meta-path patterns (e.g., [semantic, causes], [entity, temporal]) with a forward-push mechanism. Mass pruning (threshold=1e-6) bounds the active frontier, so exploration stays sublinear even on large graphs. Lazy edge loading via LATERAL joins fetches only top-k neighbours per (node, link_type), sharing the cache across patterns. This is the most sophisticated graph retrieval among reviewed memory systems.

Two-level knowledge consolidation separates automation from curation. Observations are auto-generated by a consolidation engine that watches new facts post-retain and uses LLM to create/update/delete higher-order summaries. Mental models are user-curated queries that get refreshed on demand via reflect. Each observation carries a computed trend (stable, strengthening, weakening, new, stale) derived from evidence timestamps. This two-tier structure lets the system accumulate without requiring human attention at every step while preserving a curation point.

Dispositions tune reasoning without changing storage. Each memory bank has three personality traits (skepticism, literalism, empathy; each 1–5) that are injected into the reflect system prompt. Dispositions affect only reflect, not retain or recall — the same facts produce different answers depending on the bank's personality. This cleanly separates storage invariants from interpretation policy.

Reflect is an agentic tool-calling loop, not a search endpoint. The reflect operation runs up to 10 iterations of an LLM tool-calling loop with hierarchical retrieval tools: search_mental_modelssearch_observationsrecallexpand (graph traversal) → done. The hierarchy encourages the agent to check curated knowledge before falling back to raw facts. This makes reflect qualitatively different from recall — it's a reasoning process, not a retrieval query.

Comparison with Our System

Dimension Hindsight Commonplace
Storage substrate PostgreSQL + pgvector; memories are rows with embeddings Filesystem-first; notes are markdown files under version control
Memory taxonomy world / experience / opinion / observation (enforced by schema) knowledge / self / operational (enforced by convention and type system)
Write cost High — every retain requires LLM extraction + entity resolution + embedding Low — human writes markdown; no LLM call required for storage
Consolidation Automated — consolidation engine creates observations post-retain Manual — human writes notes, /connect discovers relationships
Retrieval Four-way parallel search with fusion and reranking rg keyword search + description scanning + area filtering
Graph model Directed links (temporal, semantic, entity, causal) between memory units Standard markdown links with explicit relationship semantics
Verification LLM confidence scores on opinions; trend tracking on observations Type system + structural validation (/validate) + semantic review
Inspectability Opaque to humans without the API; requires UI or SQL to browse Fully inspectable — every note is a readable file

Where Hindsight is stronger. Retrieval at scale. The four-way fusion with MPFP and cross-encoder reranking will outperform keyword search on large corpora, especially for paraphrase and temporal queries. Automated consolidation means the system learns without human attention. Benchmark results confirm this — LongMemEval SOTA is a real signal.

Where commonplace is stronger. Inspectability and human curation. Every note is a readable file with explicit link semantics; the methodology IS the content. There's no opaque database layer between the human and the knowledge. The type system and structural validation catch errors that Hindsight's LLM-driven pipeline might introduce silently. And write cost is near zero — no LLM call to store a thought.

The fundamental trade-off. Hindsight automates the retain→consolidate→recall loop at the cost of inspectability and write-time computation. Commonplace keeps the human in the loop at the cost of retrieval sophistication and automated consolidation. These are complementary, not competing — Hindsight targets agent-operated memory at scale; commonplace targets human-AI collaborative knowledge work.

Borrowable Ideas

Trend computation on consolidated knowledge — computing whether an observation is strengthening, weakening, stable, new, or stale from evidence timestamps. Commonplace notes have status (current/outdated/speculative) set manually; deriving staleness from link recency could automate part of this. Needs a use case first — our link structure doesn't carry timestamps, so the prerequisite is richer link metadata.

Disposition-as-prompt-injection for per-context reasoning — the pattern of storing personality traits on a memory bank and injecting them into the system prompt is clean and general. We could use this for area-specific reasoning styles (e.g., a "speculative" area that gets higher tolerance for uncertain claims). Ready to borrow — requires only instruction-level changes.

Hierarchical retrieval in reflect — the tool ordering (curated → consolidated → raw) is a sound heuristic: check the best knowledge first, fall back to ground truth. Our /connect skill could adopt this pattern — check existing indexes and connection reports before searching raw notes. Ready to borrow.

Meta-path patterns for graph traversal — defining typed traversal paths ([entity, temporal], [semantic, causes]) rather than blind BFS/DFS. If we ever build automated link traversal, meta-paths are the right abstraction. Needs a use case first — our graph is small enough for manual traversal.

What We Should Not Borrow

Fully automated consolidation. Hindsight's consolidation engine creates observations without human review. In our system, the human is the oracle — automated consolidation would undermine the curation point that makes distillation trustworthy. The observations Hindsight creates are useful for retrieval but lack the grounding discipline our notes require.

Database-first storage. PostgreSQL with pgvector is the right choice for Hindsight's scale and access patterns, but adopting it would sacrifice our core property: every piece of knowledge is a readable, version-controlled file. The inspectable substrate, not supervision, defeats the blackbox problem.

LLM-mediated writes. Requiring an LLM call for every retain operation adds cost, latency, and a failure mode (extraction errors) to the write path. Our system's zero-LLM-cost writes are a feature, not a limitation.

Curiosity Pass

What property does Hindsight actually claim? That biomimetic memory organisation (world/experience/observation) combined with multi-strategy retrieval produces better long-term memory performance than flat vector search or knowledge graphs alone. The LongMemEval benchmark confirms retrieval accuracy, but doesn't test whether the biomimetic taxonomy specifically causes the gain vs. the four-way fusion and reranking doing most of the work.

Does the mechanism transform data or just relocate it? The retain pipeline genuinely transforms — LLM extracts structured facts from unstructured text, entity resolution canonicalises mentions, and consolidation synthesises observations from facts. The reflect pipeline also transforms via its agentic reasoning loop. This is not just storage-and-retrieval.

What's the simpler alternative? Vector search + reranking (skip the graph, skip consolidation). Hindsight's own architecture suggests this would lose ~15-20% retrieval accuracy based on ablation patterns in similar systems. The MPFP graph traversal and observation layer are where the marginal gains come from.

What's the ceiling? The system depends on LLM quality for fact extraction and consolidation — extraction errors compound into graph errors. Entity resolution uses string similarity (trigram + SequenceMatcher), which will fail on semantically equivalent but lexically different entities. The consolidation engine can only synthesise what it extracts; conceptual insights that require cross-document reasoning beyond the consolidation window won't emerge.

What to Watch

  • Does the biomimetic taxonomy survive scaling? Four fact types × four retrieval strategies = combinatorial complexity. As banks grow past 100K memories, does the per-type HNSW partitioning continue to help, or does it fragment the index?
  • Consolidation quality at volume. Auto-generated observations are only as good as the LLM's synthesis. Do observations drift or contradict each other as the fact base grows? The trend computation helps detect staleness but not contradiction.
  • Entity resolution ceiling. Trigram similarity will miss "NYC" ↔ "New York City" and similar semantic equivalences. How much retrieval accuracy is lost to entity fragmentation in production?
  • Competition from simpler baselines. As embedding models improve, does simple vector search + reranking close the gap with four-way fusion? If so, Hindsight's architectural complexity becomes a liability.

Extends three-space agent memory maps to Tulving taxonomy — Hindsight's world/experience/observation taxonomy is the closest production implementation of the theoretical three-space separation, with benchmark evidence that the separation improves retrieval.

Extends three-space memory separation predicts measurable failure modes — Hindsight's per-type HNSW partitioning and type-aware retrieval are a direct test: if flat memory predicted cross-contamination failures, Hindsight's type separation should avoid them.

Grounds distillation — Hindsight's consolidation engine (facts → observations → mental models) is a concrete implementation of automated distillation; trend tracking shows the temporal dynamics of distilled knowledge.

Contrasts inspectable substrate not supervision defeats the blackbox problem — Hindsight's PostgreSQL storage is opaque to humans without API/UI mediation, making it a concrete example of the supervision-over-substrate approach.

Contrasts ephemeral computation prevents accumulation — Hindsight is maximally anti-ephemeral: every interaction triggers extraction, storage, linking, and consolidation. Nothing is discarded.

Exemplifies constraining and distillation both trade generality for reliability speed and cost — the retain pipeline constrains (fact typing, entity canonicalisation) while consolidation distils (observations from facts); both trade the generality of raw text for structured, retrievable knowledge.

Extends memory management policy is learnable but oracle-dependent — Hindsight's consolidation engine learns what to consolidate using task-completion oracles (LLM judgement); the same oracle-dependency limitation applies.

Tests CLAW learning is broader than retrieval — Hindsight's reflect operation goes beyond retrieval into reasoning (tool-calling loop with hierarchical knowledge access), but the retain/recall path remains retrieval-focused. The system partially validates the claim that learning requires more than search.

#related-systems #learning-theory #memory-architecture