OpenViking

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

A context database for AI agents built by Volcengine (ByteDance's cloud unit), licensed AGPL-3.0. At GitHub HEAD 37e106a958f4e427cfc922e32828f5c16d08c388 (inspected 2026-04-13), the repo has taken roughly 360 commits: it split the chat-runtime into a bot/ package (Vikingbot, derived from Nanobot), added account-namespace-shared sessions, hotness-blended rerank in the hierarchical retriever, account-level multi-tenancy on every URI, a rewritten memory deduplicator with MERGE/DELETE action lists, an eight-category extractor (profile/preferences/entities/events/cases/patterns/tools/skills) that specializes ToolSkillCandidateMemory with duration and call-count statistics, a RequestContext+PathLock-guarded two-phase commit_async that archives under a distributed lock and then spawns memory extraction via asyncio.create_task, and LoCoMo benchmark scripts targeting Supermemory, Mem0, and OpenClaw. The MCP-server code has been removed from the Python tree; MCP integration now lives in the OpenClaw plugin and in Vikingbot's MCP client (essentially a vendored Nanobot MCP client). The filesystem metaphor remains the dominant framing — viking://resources/, viking://user/<space>/memories/, viking://agent/<space>/memories/, viking://agent/<space>/skills/, viking://session/<id>/ — backed by AGFS plus a VikingDB vector index, reached only through the openviking-server HTTP service or the ov CLI/Rust ov_cli.

Repository: https://github.com/volcengine/OpenViking

Core Ideas

Virtual filesystem as presentation layer over AGFS + vector index. Everything the agent can reach lives under a viking:// URI whose first segment is a scope (resources, user, agent, session) and whose second segment is a per-account space (ctx.user.user_space_name() / agent_space_name()). openviking/server/routers/filesystem.py exposes ls, tree, stat, mkdir, and the service/fs_service.py layer implements them against a Go-based AGFS for content storage and a VikingDB-backed vector index for search and directory children. Nothing on the filesystem is addressable with cat or grep — the CLI exposes ov grep as a server-side scan. The interesting design commitment is that every piece of context, including session transcripts under the session scope, ends up under the same URI tree with the same hidden-file conventions (.abstract.md, .overview.md, .relations.json), so the retrieval and listing machinery does not need scope-specific code paths.

L0/L1/L2 as a storage-level property, not a retrieval trick. ContextLevel is a hard enum (ABSTRACT=0, OVERVIEW=1, DETAIL=2) that participates in the vector index schema. HierarchicalRetriever.LEVEL_URI_SUFFIX = {0: ".abstract.md", 1: ".overview.md"} shows the convention: every directory is represented in the index three times (abstract, overview, full directory record), and individual leaf files get an L2 entry. The write path builds these deterministically — the memory extractor returns a CandidateMemory with all three fields populated (abstract, overview, content) in one LLM call, so the three-tier representation is a precondition for insertion, not a derived enrichment. This is the single most copyable architectural choice in the system: tiered loading is enforced by the type signature of what enters storage.

Hierarchical recursive retrieval with score propagation and convergence stop. HierarchicalRetriever.retrieve (openviking/retrieve/hierarchical_retriever.py, 617 lines) runs: (1) optional intent analysis in IntentAnalyzer to turn conversation context into up to ~5 TypedQuery items, (2) a global vector search seeded by context_type roots (viking://user/<space>/memories, etc.) or explicit target directories, (3) a priority queue of candidate directories keyed by a running score, with SCORE_PROPAGATION_ALPHA = 0.5 blending the current directory's score with each child's match (final_score = alpha * child + (1 - alpha) * parent_score), (4) convergence detection that stops after MAX_CONVERGENCE_ROUNDS = 3 rounds with unchanged top-k, and (5) a hotness blend at conversion time (HOTNESS_ALPHA = 0.2; hotness_score = sigmoid(log1p(active_count)) * exp(-ln2 * age_days / 7)). Level-2 leaves are treated as terminal (not re-expanded). L0 entries participate in the queue but are added as "starting points" filtered out of the initial candidate pool — the queue walks directories while leaves accumulate in collected_by_uri.

Eight-category memory extractor feeding an LLM-gated deduplicator. MemoryExtractor.extract (1504 lines) formats messages plus a history_summary into compression.memory_extraction, asks the VLM for a list of categorized memories, then, for tools/skills candidates only, calibrates the LLM-proposed name against ToolPart traces collected during the session and attaches call_time, success_time, duration_ms, prompt_tokens, and completion_tokens from tool_stats_map/skill_stats_map. Extraction that loses a tool name or finds zero call-count is dropped before it reaches the deduplicator. MemoryDeduplicator.deduplicate then vector-searches the candidate against similar memories in the same category URI prefix, and, if anything matches, asks the VLM for one of three top-level decisions — SKIP / CREATE / NONE — plus a per-existing MERGE or DELETE action list. NONE means "don't create anything but resolve conflicts among existing entries." The LLM's decision is the authority; the code normalizes pathological combinations (e.g., CREATE+empty actions, SKIP+non-empty actions) but does not second-guess classifications.

Two-phase commit: PathLock-guarded snapshot, background extraction. Session.commit_async acquires a distributed filesystem LockContext scoped to the session's URI, increments compression.compression_index, copies the live messages into an archive, and clears the live session list — all under the lock. The lock is then released, the raw messages.jsonl is written to viking://session/<id>/history/archive_NNN/, and a TaskRecord is created in the global task_tracker. Only then is asyncio.create_task(self._run_memory_extraction(...)) fired; extraction (intent-aware chunking, per-category extraction, deduplication, queue-backed vector write) runs out of band. The API returns {status: accepted, task_id} immediately. A FailedPreconditionError is raised if any prior archive is marked failed — so a crashed extraction blocks future commits until explicitly cleared.

Account-namespaced URIs with role-based tenancy. RequestContext.user.account_id, user_space_name(), and agent_space_name() are plumbed through every storage call. HierarchicalRetriever._get_root_uris_for_type returns viking://user/<user_space>/memories and viking://agent/<agent_space>/memories — both derived from the request's UserIdentifier. The Role enum (ROOT, ADMIN, USER) gates what a request can cross-read, and feat/account-namespace-shared-sessions (new since the last review) lets explicitly shared namespaces cross-read sessions across users. This is a different tenancy model than cognee's ACL-on-datasets: here the namespace lives in the URI, so unauthorized cross-account access is impossible by path construction, not by filter rewriting.

Session transcripts are themselves first-class directories. viking://session/<id>/history/archive_NNN/messages.jsonl exists alongside abstract and overview overlays, indexed into the same vector store. Session archives are retrievable by the same hierarchical machinery as resources or memories — a completed session is searchable context, not just input to extraction. The LoCoMo evaluation in benchmark/locomo/ relies on this: the multi-session dialogue recall task scores OpenViking against Mem0, Supermemory, and OpenClaw's native memory, and the reported advantage (52% task completion at 4.2M input tokens vs LanceDB at 44.5% with 51.6M tokens) is almost entirely on the retrieval-from-archived-sessions axis.

OpenClaw plugin is the real integration story. examples/openclaw-plugin/ is a TypeScript integration that hooks before_prompt_build, session_start, session_end, agent_end, and before_reset against an OpenClaw agent runtime. It recalls memories in parallel from viking://user/memories and viking://agent/memories, reranks by "is it a leaf memory with level==2 / does it look like a preference / event / what is lexical overlap with the current query," trims under a token budget, and prepends a <relevant-memories> block. The plugin does not use MCP for this — it speaks directly to the OpenViking HTTP API with X-OpenViking-Account / X-OpenViking-User / X-OpenViking-Agent headers. The Vikingbot (forked from Nanobot) now also speaks MCP as a client to third-party servers, but its own knowledge lookup is a direct OpenViking call.

Comparison with Our System

Dimension OpenViking Commonplace
Storage substrate AGFS (Go, local/S3/memory backends) + VikingDB vector index, reached through HTTP service Markdown files in git, readable by any tool
Filesystem semantics Virtual viking:// URIs with account-namespace prefixes Actual POSIX paths, shell-native
Tiered loading L0 abstract (~100 tok) / L1 overview (~2k tok) / L2 full content; auto-generated at write, enum-enforced in index schema Link context phrase (~20 tok) / description (~50 tok) / note body; human-authored, convention-enforced via WRITING.md
URI scopes / indexed content types resources, user, agent, session scopes; indexed content types are Resource / Memory / Skill, with Memory subdivided into eight categories at extraction time text / note / structured-claim / adr / index / instruction-note (lifecycle with seedling → current)
Retrieval Intent analyzer → typed queries → hierarchical recursive search with 0.5/0.5 parent/child score propagation, hotness blend, rerank optional Agent navigation via links, descriptions, and curated indexes
Link semantics .relations.json with reason strings, aggregated bottom-up Typed link markup (extends, grounds, contradicts, refines) with articulated context phrases
Multi-tenancy Account/user/agent spaces in the URI; RequestContext threaded through every call Single repository, no tenancy model
Write path commit_async with PathLock + background task + TaskRecord Human or agent edits, followed by commonplace-validate
Session management First-class: session directory, archived JSONL, compression indices, background extraction None — sessions are not modeled
Update discipline LLM-gated SKIP/CREATE/NONE + per-memory MERGE/DELETE Human review, status transitions, review-system warnings
Observability RetrievalStatsCollector, Prometheus observer, retrieval trajectory logs Git history, validate output, review bundles
Tenancy of integrations OpenClaw plugin (TS), Vikingbot (Python, with MCP client), Claude Code / OpenCode memory plugins (TS) Directly edited by an agent in the session

Where OpenViking is ahead. Three-tier representation is a storage precondition rather than a quality checklist — a note cannot enter the index without all three levels populated. Session management is production-grade: PathLock-guarded archival, TaskRecord for async extraction, re-enqueue counters on the vector write queue, tool/skill call statistics carried as first-class fields on candidate memories. Multi-tenancy is baked into URIs, so a slip in filter construction cannot leak data across accounts. RetrievalStatsCollector and the Prometheus observer give operational visibility that commonplace has no equivalent to. The OpenClaw plugin's memory-ranking code — leaf-vs-directory preference, lexical-overlap tie-breaking, preference/event type nudges — is exactly the kind of retrieval heuristic that our pure agent-navigation model offloads to the LLM.

Where commonplace is ahead. Substrate. OpenViking's filesystem is virtual: ls, find, grep are server RPCs, so git, standard shell tools, and any non-OpenViking editor are outside the loop. Our methodology rides on actual files — a human can rg the repository, diff two branches, or operate via commonplace-* CLIs with no server running. Link semantics: OpenViking's .relations.json stores reason strings, but there is no machinery that uses the why of a relation for navigation or quality gating; our typed links (extends/grounds/contradicts/refines) and articulated context phrases encode the cognitive step a reader should take. Lifecycle: OpenViking's memory items are created-once, merged-on-dedup, or deleted — there is no seedling→current→superseded maturation or type promotion (text→note→structured-claim). Review as a methodology: the review-system with semantic/structural bundles has no analog; OpenViking's quality signal comes from retrieval stats and hotness, not from a curator's judgment.

The deepest convergence is still the tiered loading pattern. OpenViking arrived at a three-layer representation for the same reason we did — making the follow-or-skip decision cheap — but baked it into storage. We get the same effect through writing convention and link discipline. The fact that two systems with very different architectures (service-backed virtual filesystem, git-backed markdown) both converged on the same three layers suggests the pattern is structural to context engineering, not incidental.

The deepest divergence is what "session" means. OpenViking treats sessions as first-class, archivable, retrievable URIs. Our missing workshop layer is basically an admission that we don't have this yet. OpenViking's commit_async + background extraction is the closest thing among reviewed systems to a production implementation of workshop-to-library promotion — the archive is the workshop, the extracted memories are the library, and TaskRecord is the lifecycle handle.

Borrowable Ideas

Make L0/L1/L2 a type-system invariant, not a writing checklist. Our notes have descriptions (close to L1) and link phrases (close to L0), but the body of a note can be rewritten, moved, or deleted independently of those fields, and nothing forces the three to stay in sync. OpenViking's model — CandidateMemory(abstract, overview, content) is the only way a memory enters storage — suggests a tighter invariant. Concretely: commonplace-validate could check that every note has a high-quality description (length, specificity), that every inbound link to it carries a context phrase, and that the description is consistent with the first paragraph of the body. Ready to borrow now — incremental hardening of existing fields, no new infrastructure.

Stats-enriched candidate classes for operational knowledge. ToolSkillCandidateMemory extends CandidateMemory with call_time, success_time, duration_ms, best_for, common_failures, recommendation. When the system extracts a "tool usage" memory, those fields are not optional prose — they are the schema. We could do the equivalent for structured-claim notes about recurring KB operations: a subtype with required "preconditions," "observed failure modes," "success heuristic" fields. Needs a use case first — would only pay off if we find ourselves writing the same structure repeatedly.

Two-phase commit pattern for workshop→library promotion. The split between commit_async phase 1 (lock-guarded snapshot, guaranteed durable) and phase 2 (background extraction, retry-able, tracked by a TaskRecord) is a cleaner workshop-layer design than "the agent occasionally writes library notes." If we ever implement the workshop layer, the pattern is: (1) an atomic snapshot of the workshop-state under a filesystem lock, (2) background promotion to library artifacts, (3) a tracker entry that lets a follow-up session resume or retry. Ready to borrow conceptually — the primitives (filesystem locks, git, task records) already exist.

Retrieval trajectory logging even without automated retrieval. OpenViking's RetrievalStatsCollector records per-query latency, score distribution, rerank usage. The observer pattern under storage/observers/ is lightweight — a protocol with hook points, attached to the storage layer. An equivalent for commonplace would be a session-level log of "which notes the agent read in what order, and what was loaded but unused" — recoverable from the Claude Code transcript but not currently surfaced. Useful the moment we need to debug why an agent did not find a note.

URI-level account namespacing as a data-isolation pattern. Even without multi-user commonplace, the insight that scope belongs in the path rather than in a filter is valuable. If we ever support per-domain knowledge spaces (e.g., kb-company/ alongside kb-personal/), encoding the scope into the URI prefix — rather than tagging notes and filtering — means mistakes become impossible rather than just caught. Needs a use case first — currently single-scope.

LLM-gated MERGE/DELETE actions on dedup, not just SKIP/CREATE. When commonplace finally automates any form of note consolidation (duplicate notes, superseded notes), the DedupResult.actions: List[ExistingMemoryAction] pattern — a candidate-level decision plus per-existing-target actions, normalized against pathological combinations — is a useful schema. Not ready to borrow — we do not currently automate dedup; but if we do, copy this shape rather than inventing one.

Curiosity Pass

Does the hierarchy actually matter for retrieval quality, or does the LoCoMo win come from tiered loading? The reported LoCoMo advantage (52% vs 44.5% task completion at 1/12 the input tokens) sits next to a hierarchical retriever whose score propagation coefficient is 0.5/0.5 and whose convergence stop is conservative (3 rounds). But the token-cost win is mostly explained by loading L0 abstracts instead of full documents — even a pure flat vector search with the same L0/L1/L2 tiers would realize most of the saving. The hierarchical machinery (global search → directory queue → recursive drill-down) is only doing work when directory proximity correlates with semantic proximity, which is true for well-structured resources but not obviously true for organically accumulated session memories. The repo has openviking/eval/ragas/ and newly added LoCoMo scripts against Mem0 and Supermemory, but no published A/B between their own retriever and a flat-vector variant on the same tiers. What does this mechanism achieve even if it worked perfectly? Hierarchical retrieval can outperform flat search when a query benefits from inheriting context from an ancestor directory (e.g., "the OAuth doc under the auth section under the API docs"). For preferences or events mined from sessions, ancestor context is weaker, so the marginal gain is smaller. The claim that hierarchy is what makes OpenViking good is probably overstated; tiered loading is what makes OpenViking cheap.

The "filesystem paradigm" is a service API dressed in path notation. viking://resources/... looks like a filesystem and the agent interacts via ls/tree/stat/mkdir, but those are HTTP endpoints on openviking-server. ov grep is a server-side scan, not a client-side POSIX tool. You cannot cat a viking:// URI, diff two with git, open one in vim, or mount the tree with FUSE unless you install the optional bot-fuse extra. The "manage agent memory like managing local files" framing is about the mental model, not the tooling surface. This matters because most of the argument for files-not-database is precisely about tooling interoperability — if ls is an API, ls is no longer the thing that made files good. The filesystem metaphor relocates database records into path-looking namespaces without transforming the access model. The metaphor may actually reduce clarity: a user who assumes real filesystem semantics will be surprised when mv across accounts is forbidden by URI construction, or when grep does not see uncommitted writes in flight through the write queue.

Eight categories claim structure the mechanism does not enforce. MemoryCategory has eight values, but category assignment is whatever the extraction LLM writes in the JSON response, defaulting to PATTERNS on parse failure. There is no schema constraint that makes a "preference" structurally different from a "pattern" once inserted — only the directory path differs. Compare with our type system: a structured-claim requires Evidence / Reasoning / Caveats sections that commonplace-validate can check. OpenViking's eight categories are labels, not constraints. The one exception is ToolSkillCandidateMemory, where the extra statistics fields are a schema discontinuity — the tools/skills categories really are different from the others. This is strong evidence for a design principle we already follow: types earn their keep when they have structural obligations, not just naming conventions.

Hotness blending has a 7-day half-life baked in. DEFAULT_HALF_LIFE_DAYS: float = 7.0. With HOTNESS_ALPHA = 0.2, the blend is 0.8 * semantic + 0.2 * hotness, where hotness is bounded in [0, 1]. A memory accessed many times last week scores ~0.2 of hotness; a never-accessed memory from today scores 0.5 on recency times sigmoid(log1p(0))≈0.5 on frequency ≈ 0.25 hotness. The scale is modest enough that the blend rarely flips rankings — but it is enough to decay a memory that has not been retrieved in a month to near zero, which biases retrieval toward recent activity. For a KB methodology, the analog would be "weight recent edits and recent reads above older ones," which is almost the opposite of what long-lived knowledge wants. Be careful what to borrow here.

Self-evolution is still session-extraction; memories are rarely revised. The documentation sells "the agent gets smarter with use" as self-evolution. The actual mechanism: at session commit, extract new memories, dedup against existing, merge conflicting ones. The MERGE action does rewrite existing memory content, so this is genuinely more than extract-and-append — but the merge is per-conflict and driven by the current session's candidate, not by cross-session synthesis, pattern detection across many sessions, or contradiction-driven revision. Events and cases are marked "no update." Preferences and entities are mergeable. This places OpenViking in the same slot as the comparative review identified for most systems: everyone automates extraction, almost nobody automates synthesis. The newly added LoCoMo benchmark does not change this — it measures recall-from-archive, not quality-of-synthesis.

Trace-derived learning placement. Trace source: OpenViking owns a structured message schema (openviking.message with ToolPart for tool calls; Message.to_jsonl() for archive format). A session's lifetime trace consists of user/assistant messages plus tool invocations with prompt_tokens, completion_tokens, duration_ms, and success/error status collected in tool_stats_map/skill_stats_map. The trigger boundary is session.commit_async: every commit snapshots the live message list, archives it as archive_NNN/messages.jsonl, and fires a single extraction task. Extraction: one LLM call proposes a list of CandidateMemory items in eight categories (profile/preferences/entities/events/cases/patterns/tools/skills), with tools/skills specializing into ToolSkillCandidateMemory that carry runtime statistics. The VLM proposes candidate memories; the deduplicator then decides SKIP/CREATE/NONE plus per-existing MERGE/DELETE. A parsed candidate list is accepted, but storage promotion still flows through dedup rather than landing directly in the index. Promotion target: inspectable artifacts only (markdown+JSON blobs under viking://user/<space>/memories/ or viking://agent/<space>/memories/ with L0/L1/L2 representations in the vector index). No weights, no compiled runtime state. Scope: cross-task within a namespace, but single-namespace scoped — a memory extracted in session A can be retrieved in session B for the same account, but never crosses account boundaries. Timing: online-at-commit, background via asyncio.create_task, with a TaskRecord for tracking and FailedPreconditionError blocking further commits if extraction failed irrecoverably. On the survey's axes, OpenViking is unchanged from the prior placement: axis 1 is service-owned trace backend (OpenViking owns the message schema and accepts structured traffic via HTTP, separating archive from extraction), and axis 2 is symbolic artifact learning (eight typed categories, no weight promotion). The new ToolSkillCandidateMemory with enforced statistics fields strengthens the "typed durable observations" subtype — OpenViking is now further from ClawVault's weekly-reflection pattern and closer to a structured-record system, but not enough to warrant a new subtype.

What to Watch

  • Whether feat/account-namespace-shared-sessions lands on main and what cross-account memory sharing rules it introduces — the URI-namespace isolation is the system's cleanest guarantee, and anything that lets a memory cross it deserves scrutiny.
  • Whether the newly added benchmark scripts (LoCoMo vs Mem0, Supermemory, OpenClaw) are run with ablations over tiered loading alone vs tiered loading + hierarchical retrieval. The current published numbers don't separate those two effects.
  • Whether the MCP tool surface is reintroduced on the server side or stays relegated to client-side wrappers (OpenClaw plugin + Vikingbot). Without a first-party MCP server, "OpenViking as MCP memory backend" is an integrator task.
  • Whether the MERGE action on dedup grows into genuine synthesis — revising a memory based on contradiction with a later memory, rather than conflict with a same-session candidate. Merging across sessions is where the "automating synthesis" wall would move.
  • Whether the filesystem metaphor ever gets actual-file backing (the bot-fuse optional extra is a hint, but not a default). If they ship FUSE or a git-backed mode by default, the tooling-interoperability argument closes.
  • Whether hotness blending's 7-day half-life gets tuned per content type — a session event probably should decay faster than a user preference, but the current constant treats them the same.

Relevant Notes:

  • files-not-database — contrasts: OpenViking's filesystem is a service API dressed in path notation, not actual files; its tooling is outside the loop for git, grep, and text editors, which is exactly the case this note argues against
  • context-efficiency-is-the-central-design-concern-in-agent-systems — exemplifies: enum-enforced L0/L1/L2 at the storage schema level is the most concrete implementation of context efficiency as an invariant
  • agents-navigate-by-deciding-what-to-read-next — extends: L0 abstracts serve the follow-or-skip pointer role; the three tiers formalize progressive disclosure as a storage precondition
  • three-space-agent-memory-echoes-tulvings-taxonomy-but-the-analogy-may-be-decorative — partially maps: Resource/Memory/Skill roughly corresponds to semantic/episodic/procedural, but the eight-category Memory subdivision mixes semantic preferences with episodic events inside one space
  • automating-kb-learning-is-an-open-problem — exemplifies: extraction is industrialized (8 categories, per-category dedup, MERGE/DELETE actions), but synthesis across sessions remains unsolved even here
  • a-functioning-kb-needs-a-workshop-layer-not-just-a-library — exemplifies: session.commit_async is the cleanest production implementation of two-phase workshop-to-library promotion among reviewed systems
  • distillation — exemplifies: L0/L1/L2 generation is distillation at three resolution levels, produced in one LLM call at write time rather than at read time
  • agentic-memory-systems-comparative-review — extends: OpenViking holds the developer-managed service + virtual-filesystem position on the agency axis; the new multi-tenant URI scheme sharpens the distinction from single-user vaults
  • cognee — sibling: both use poly-layer storage (content substrate + vector index) with LLM-heavy ingestion; OpenViking's filesystem metaphor vs cognee's pipeline metaphor are presentations of similar "invest at ingestion to make retrieval cheap" infrastructure
  • crewai-memory — sibling: both handle session-derived memory with LLM deduplication; CrewAI uses scope trees for namespace separation, OpenViking encodes the namespace directly in URIs
  • trace-derived-learning-techniques-in-related-systems — axis placement: service-owned trace backend (axis 1), symbolic artifact learning (axis 2), typed durable observations subtype with a tool/skill statistics specialization