OpenViking

Type: note · Status: current · Tags: related-systems

A context database for AI agents built by Volcengine (ByteDance's cloud division), open-sourced March 2026 under Apache 2.0. OpenViking replaces flat vector storage with a virtual filesystem (viking:// protocol) that organizes memory, resources, and skills into a hierarchical directory structure with tiered content loading (L0/L1/L2). The system's production pedigree comes from ByteDance's internal vector search infrastructure powering TikTok since 2019. At launch it reached #1 trending on GitHub with 13k+ stars.

Repository: https://github.com/volcengine/OpenViking

Core Ideas

Filesystem paradigm as the unifying abstraction. Everything lives under a viking:// URI scheme with four top-level scopes: resources/ (user-added knowledge), user/ (user memories and preferences), agent/ (skills, instructions, learned patterns), and session/ (conversation state). Agents interact via filesystem commands — ls, read, find, grep, glob, tree — rather than database queries. Each directory contains hidden files (.abstract.md, .overview.md, .relations.json) that provide progressive detail layers. The insight is that filesystem operations are deterministic and observable — you can trace exactly which paths were accessed — while vector search is a black box. The trade-off: the filesystem metaphor makes context management intuitive for developers but the directories are virtual (backed by AGFS + vector index), so the "filesystem" is a presentation layer over a database, not actual files on disk. The metaphor promises tool interoperability (any editor, git, grep) that the implementation doesn't deliver — you need the OpenViking client or HTTP API.

L0/L1/L2 tiered context loading as a native storage primitive. Every directory automatically generates three resolution levels: L0 (.abstract.md, ~100 tokens) for quick relevance filtering, L1 (.overview.md, ~2k tokens) for understanding scope and deciding whether to drill deeper, L2 (original files) for full content loaded only on demand. Generation is bottom-up: leaf file summaries feed into parent overviews, parent abstracts aggregate child abstracts. This is LLM-automated progressive disclosure baked into the storage layer — every write triggers async semantic processing that builds the summary hierarchy. The reported cost reduction is dramatic: 550 tokens average per retrieval vs full content loading, claimed 95% reduction. This is the single most borrowable idea in the system, and the first production implementation that makes progressive disclosure a structural property of storage rather than an application-layer concern.

Hierarchical recursive retrieval. Rather than flat vector search, retrieval follows a directory-aware algorithm: (1) intent analysis generates 0-5 typed queries from conversation context, (2) global vector search locates high-scoring starting directories, (3) recursive drill-down searches within those directories using a priority queue with score propagation (final_score = 0.5 * child_score + 0.5 * parent_score), (4) convergence detection stops after 3 rounds of unchanged top-k results. Source code (hierarchical_retriever.py:313-457) confirms the algorithm. Optional rerank and hotness scoring (0.2 * hotness + 0.8 * semantic) blend usage frequency into results. The question is whether this hierarchical traversal measurably outperforms flat vector search — the repo has evaluation infrastructure (openviking/eval/ragas/) but no published comparisons showing the hierarchy's marginal value.

Three context types with distinct lifecycles. Resources (user-added, static knowledge), Memory (agent-extracted, dynamically updated), and Skill (callable capabilities, static definitions). This maps partially to the three-space separation: resources ≈ semantic/knowledge space, agent memories ≈ procedural space (cases, patterns), user memories ≈ episodic/self space. But the mapping is imperfect — user memories mix episodic events with semantic preferences, and agent memories mix procedural patterns with what are really semantic generalizations. The type system is pragmatic rather than theoretically grounded.

Session-driven memory extraction with LLM deduplication. On session commit, the system: (1) archives messages with structured summaries, (2) extracts memories into 8 categories (profile, preferences, entities, events for user; cases, patterns, tools, skills for agent — documentation claims 6, code has 8), (3) runs vector pre-filtering to find similar existing memories, (4) uses LLM to decide per candidate: SKIP (duplicate), CREATE (new), or NONE (resolve against existing) with per-item MERGE or DELETE actions. The dedup pipeline (memory_deduplicator.py) is well-engineered with normalization rules (e.g., CREATE+MERGE auto-normalized to NONE). This is more sophisticated than Mem0's CRUD decisions and comparable to CrewAI's consolidation — but like both, it handles deduplication, not synthesis.

Dual-layer storage separating content from index. AGFS (Agent File System, implemented in Go with local/S3/memory backends) stores actual content. A vector index (local, HTTP, or Volcengine VikingDB) stores URIs, embeddings, and metadata but no file content. All reads go through AGFS; the vector index is purely for search. Deletes and moves automatically sync between layers. This is a clean separation of concerns, but makes OpenViking a server dependency rather than a tooling-agnostic substrate.

MCP server for universal agent integration. Exposes search, add_memory, add_resource, list_memories, list_resources, and get_status as MCP tools. Supports HTTP/SSE (recommended, multi-session safe) and stdio (single-session only, prone to storage contention). Configuration documented for Claude Code, Claude Desktop, Cursor, and OpenClaw. This makes OpenViking usable as a memory backend for any MCP-compatible agent — the most practical integration story among reviewed systems.

Comparison with Our System

Dimension OpenViking Commonplace
Storage Virtual filesystem (AGFS + vector index), server-dependent Actual markdown files in git, tool-agnostic
Knowledge unit File/directory with auto-generated L0/L1/L2 layers Typed note with frontmatter, prose body, semantic links
Progressive disclosure Native: L0 abstract → L1 overview → L2 full content, auto-generated, guaranteed Manual: link context phrase (~20 tok) → description (~50 tok) → full note; human-crafted, convention-enforced
Context types Resource / Memory / Skill (3 types) text / note / structured-claim / adr / index (5+ types with lifecycle)
Retrieval Hierarchical recursive retrieval with intent analysis + vector + rerank Agent-driven navigation via links, descriptions, search, indexes
Link structure .relations.json per directory with reason strings Markdown links with articulated relationship semantics (extends, grounds, contradicts)
Knowledge evolution Memory dedup (skip/create/merge) + event/case accumulation Status transitions (seedling → current → superseded), type promotion, link refinement
Session management Built-in: message recording, compression, memory extraction None — no session infrastructure
Learning theory None — pragmatic engineering Constraining, distillation, codification framework
Agency model Developer-managed service Human+agent collaborative
Context efficiency Structural (L0/L1/L2 baked into storage) Navigational (descriptions as retrieval filters, follow-on-read)
Observability Retrieval trajectory visualization, per-query stats Git history, link audit, /validate
Multi-agent MCP server, HTTP API, multiple deployment modes Single-agent per session

Where OpenViking is stronger. The L0/L1/L2 tiered loading is auto-generated and guaranteed — every directory gets all three tiers at write time, while our equivalent (link phrases, descriptions, full text) depends on authoring discipline and may be absent or low-quality. Session management with automatic memory extraction is production infrastructure we lack entirely. The MCP server makes it immediately usable by any agent. Retrieval observability (visualized trajectories, per-query statistics via RetrievalStatsCollector) exceeds what git history provides for debugging navigation failures. Multi-format ingestion (PDF, HTML, code with AST skeleton extraction via tree-sitter, images, video) is far broader than our markdown-only substrate.

Where commonplace is stronger. Knowledge has a lifecycle — notes mature through status transitions, links articulate why things relate (not just that they do), descriptions are retrieval filters crafted for agents deciding what to load. OpenViking's .relations.json stores reason strings but these are user-provided annotations, not the elaborative encoding that produces navigable connections. Our types (text → note → structured-claim → adr) provide progressive formalization — knowledge acquires structure as understanding develops. OpenViking has no maturation path: a memory is either present or deduplicated. Most critically, our system is actual files — readable by any tool, diffable in git, browsable on GitHub — while OpenViking's "filesystem" is a virtual layer requiring its own client.

The deepest divergence is what "filesystem" means. OpenViking uses the filesystem as a metaphor — a virtual namespace backed by AGFS and vector databases, accessed through a proprietary API. Our system uses the filesystem as the actual substrate — markdown files in git, readable by cat, editable in any editor, versionable by any tool. Both claim the benefits of filesystem-based knowledge management (structure, observability, developer familiarity), but only one delivers tool-agnostic interoperability. This matters because the argument for files over databases isn't just about simplicity — it's about avoiding the vendor lock-in that any service layer introduces.

The convergence is more revealing than the divergence. OpenViking independently arrives at progressive disclosure (L0/L1/L2 ≈ description/overview/full-content), hierarchical organization (directory tree ≈ area indexes), type separation (Resource/Memory/Skill ≈ source/note/instruction), and observable retrieval. A system built for production scale at ByteDance converges on the same structural patterns that a methodology-focused KB derives from learning theory. The patterns are robust; the implementations differ in whether structure is maintained by conventions and human judgment (us) or by automated pipelines and LLM processing (them).

Borrowable Ideas

L0/L1/L2 as a storage design pattern. The idea that every knowledge unit should have three resolution levels is powerful — and we already have it. Our link context phrases (~20 tokens, enough to decide whether to follow) serve as L0, description fields (~50 tokens, enough to decide whether to read the full note) serve as L1, and the note body is L2. The token budgets differ (theirs: ~100 / ~2k / unlimited; ours: ~20 / ~50 / unlimited) and ours is human-crafted rather than auto-generated. What OpenViking adds is making the pattern structural — auto-generated at write time, guaranteed to exist for every directory. Our system relies on convention and discipline (the WRITING.md checklist) to ensure descriptions and link phrases are present and high-quality. The borrowable insight is not the tiered pattern itself but the guarantee: could /validate enforce that every note has both a quality description (L1) and that every link to it carries a context phrase (L0)?

Bottom-up summary aggregation for indexes. OpenViking aggregates child abstracts into parent overviews. Our area indexes are manually curated with context phrases. A /summarize-index operation that reads descriptions from all linked notes and generates a structured overview would reduce the manual burden of index maintenance. Needs a use case first — our index entries carry editorial context phrases that auto-aggregation would lose.

Retrieval observability. OpenViking's RetrievalStatsCollector tracks per-query metrics (latency, score distribution, zero-result rate, rerank usage). If we ever add automated retrieval beyond agent-driven navigation, this instrumentation pattern is ready to borrow. The idea of preserving retrieval trajectories (which directories were accessed, in what order, why) is useful even for manual navigation — a /trace command that records the agent's navigation path during a task would make navigation failures debuggable. Ready to borrow conceptually.

Session commit as a memory extraction trigger. The pattern of accumulating observations during a session, then extracting durable knowledge on commit, is what our missing workshop layer needs. OpenViking's 8-category extraction (profile, preferences, entities, events, cases, patterns, tools, skills) is a concrete schema for what to extract. Ready to borrow — the categories are a useful starting template for workshop-to-library promotion.

MCP as a universal memory interface. Exposing KB operations as MCP tools would make commonplace usable as a knowledge backend for agents outside Claude Code. The OpenViking tool set (search, add_memory, add_resource, list_memories, list_resources) is a minimal viable surface area. Needs a use case first — currently single-runtime.

Curiosity Pass

Does the hierarchy actually matter for retrieval quality? The hierarchical recursive retriever is the system's most architecturally ambitious component. But the score propagation formula (0.5 * child + 0.5 * parent) means parent score dilutes child score by half at every level. For a deeply nested resource (e.g., viking://resources/project/docs/api/auth/oauth.md), the retrieval score is substantially attenuated by hierarchical weighting even when the leaf content is a perfect match. The convergence detection (stop after 3 rounds of unchanged top-k) is conservative and may miss relevant content in unexplored branches. The find() API bypasses intent analysis entirely — it's direct vector search. If most real usage goes through find() rather than search(), the hierarchy's value is primarily organizational (human-readable structure) rather than retrieval-enhancing. What could this mechanism achieve even if it works perfectly? Hierarchical retrieval can only outperform flat search when the query benefits from structural context — when "nearby" in the directory tree correlates with "related" in content. For organically structured content (a well-organized codebase), this holds. For heterogeneous memory accumulated over time, directory proximity may not track semantic proximity.

The "filesystem paradigm" is a presentation layer, not a paradigm. OpenViking's filesystem is virtual — backed by AGFS (a Go service) and a vector database. The viking:// URIs look like paths but resolve through an API, not an actual filesystem. The filesystem operations (ls, read, mkdir, mv) are service methods, not POSIX calls. You cannot cat a viking:// file, grep the workspace with standard tools, or diff two states with git. The "manage agent memory like managing local files" claim is about the mental model, not the tooling. This matters because the real benefits of filesystem-based knowledge management — tool-agnostic access, free versioning, universal browsing — require actual files. The mechanism relocates database records into path-like namespaces without transforming the access model.

Memory extraction categories claim structure that the mechanism doesn't enforce. The 8 memory categories (profile, preferences, entities, events, cases, patterns, tools, skills) are labels applied by LLM extraction. Nothing prevents the LLM from classifying a user preference as a pattern, or an event as a case. The categories are suggestions to the extractor, not structural constraints. Compare with our type system where structured-claim requires Evidence/Reasoning/Caveats sections — the structure is verifiable. OpenViking's categories are naming, not constraining.

The 95% cost reduction claim deserves scrutiny. The claim that L0/L1/L2 reduces token consumption by 95% (550 tokens average vs full content) compares against a strawman: loading all context upfront with no filtering. Any system with a search step (including vanilla RAG) loads only relevant content, not everything. The fair comparison is OpenViking's tiered loading vs traditional RAG retrieval with the same number of chunks. The LoCoMo evaluation (README benchmarks) shows OpenViking + OpenClaw achieving 52% task completion with 4.2M input tokens vs LanceDB's 44.5% with 51.6M tokens — a genuine 12x cost reduction with better accuracy. But this benchmark is specifically tuned for multi-session dialogue recall, where tiered loading has maximum advantage. The cost reduction on single-turn QA would be smaller.

Self-evolution is session-extracted memory accumulation, not evolution. The documentation promises that agents "get smarter with use" through a "self-evolution mechanism." The actual mechanism: at session end, the LLM extracts memories from the conversation and writes them to the memory directories. Subsequent sessions can retrieve these memories. This is automated note-taking, not evolution. Evolution would mean: existing memories update their content in response to new evidence, patterns strengthen or weaken based on outcomes, contradictions are detected and resolved. OpenViking's memories are immutable once extracted (events and cases are explicitly marked "no update"). The preferences and entities are appendable but not revisable. The comparative review identified this exact gap: "everyone automates extraction, almost nobody automates synthesis."

What to Watch

  • Whether the hierarchical retrieval demonstrably outperforms flat vector search on their own benchmarks. The openviking/eval/ragas/ framework exists but published A/B comparisons would settle the question.
  • Whether the L0/L1/L2 generation maintains quality as knowledge volume grows — does bottom-up aggregation produce coherent overviews when a parent directory has dozens of child directories with diverse content, or do the overviews become generic summaries that lose navigational value?
  • Whether the "filesystem paradigm" evolves toward actual filesystem backing (making the virtual directories accessible as real files for git, grep, editor access) — this would close the gap between the metaphor and the benefits it claims.
  • Whether the LoCoMo benchmark results generalize beyond multi-session dialogue recall to other agent knowledge tasks (code understanding, multi-hop reasoning, synthesis).
  • How the system handles contradictory memories — the dedup pipeline can detect near-duplicates, but what happens when a new memory contradicts (rather than duplicates) an existing one? The LLM decides, but with what guidelines?

Relevant Notes:

  • files-not-database — contrasts: OpenViking uses the filesystem as metaphor over a database backend, while this note argues for actual files; OpenViking is evidence that even filesystem-metaphor systems gravitate toward path-based organization
  • context-efficiency-is-the-central-design-concern-in-agent-systems — exemplifies: L0/L1/L2 is the most concrete implementation of context efficiency as a storage-layer concern; the tiered loading directly addresses the volume dimension of context cost
  • agents-navigate-by-deciding-what-to-read-next — extends: L0 abstracts serve exactly the "pointer with context for the follow-or-skip decision" role; the three tiers formalize what this note describes as progressive disclosure
  • three-space-agent-memory-maps-to-tulving-taxonomy — partially maps: Resource/Memory/Skill roughly corresponds to semantic/episodic/procedural, but the mapping leaks at boundaries (user preferences are semantic knowledge stored in "memory" space)
  • automating-kb-learning-is-an-open-problem — exemplifies: memory extraction is automatable; memory evolution (synthesis, contradiction resolution, reformulation) remains unsolved even in a well-funded production system
  • agentic-memory-systems-comparative-review — extends: OpenViking occupies the developer-managed service position on the agency dimension, with filesystem-metaphor storage that is neither truly files-first nor traditional database-first
  • a-functioning-kb-needs-a-workshop-layer-not-just-a-library — exemplifies: session management with commit-triggered extraction is a working implementation of workshop-to-library promotion
  • distillation — exemplifies: L0/L1/L2 generation is automated distillation at three resolution levels; the bottom-up aggregation is hierarchical distillation where child summaries feed parent overviews
  • cognee — sibling: both use dual-layer storage (content + vector index), both invest LLM compute at ingestion to make retrieval cheap; OpenViking's filesystem metaphor vs Cognee's pipeline metaphor are different presentations of similar infrastructure
  • crewai-memory — sibling: both handle session-derived memory with LLM deduplication; CrewAI uses scope trees for namespace separation, OpenViking uses directory hierarchy for the same purpose