Napkin

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

Napkin is a TypeScript/Bun knowledge system for LLM agents operating on Obsidian-compatible markdown vaults. At v0.7.2 it ships as both a napkin CLI and a Napkin SDK class (npm install -g napkin-ai), exposed through three delivery paths: the positional-argument CLI, the SDK imported by other TypeScript code, and a set of pi extensions that register napkin_* tools and drive background distillation. The project consolidated the earlier repo/package identity split: the the-shift-dev fork is gone and the canonical repo is Michaelliv/napkin again, with package metadata and docs now aligned. The core design thesis — progressive disclosure across four context levels over plain markdown files — has been validated on public long-term-memory benchmarks and is now the headline claim in the README.

Repository: https://github.com/Michaelliv/napkin

Core Ideas

SDK-first architecture, CLI as a thin wrapper. The 0.7.0 refactor separated src/core/ (pure logic, returns typed data, throws on failure) from src/commands/ (Commander wrappers that format output). Napkin in src/sdk.ts instantiates by walking up from a path to find .napkin/, then exposes overview, search, read, create, append, linksBack, tasks, bases, canvases, and thirty-plus other methods over the same vault. This turns napkin from "a CLI you shell out to" into "a library you can embed," which is what makes the pi-karpathy tool extension cheap to implement.

The progressive-disclosure ladder is now canonical and benchmark-backed. The four levels are explicit in the README, the skill file, and the code: L0 NAPKIN.md (~200 tokens, always loaded); L1 napkin overview (~1-2k tokens, TF-IDF keywords per folder); L2 napkin search (~2-5k tokens, ranked BM25 with optional snippets); L3 napkin read (full content). This is no longer just an ergonomic choice — the bench/ directory now contains live harnesses against LongMemEval (ICLR 2025), HotpotQA, and LoCoMo that exercise exactly this ladder. Reported LongMemEval numbers (92% Oracle, 91% S, 83% M at 100 questions each, described as zero preprocessing) are the strongest empirical argument in the repo that "plain BM25 over markdown plus an agent loop" is competitive with embedding/graph/summary stacks.

BM25 search with a fingerprinted persistent cache. src/core/search.ts builds a MiniSearch index over basename (boost 2) and content, then composites BM25 with links * 0.5 (raw inbound-link count) and a 0-1 normalized recency component. The 0.6.0 change that matters: the index, doc metadata, and backlink counts are serialized to .napkin/search-cache.json, keyed on an MD5 fingerprint of all file paths and mtimes. A cold search rebuilds; a warm search with no file changes reuses the cached index but still re-reads docs for snippet extraction. This is a small, local, honest optimization — no embeddings, no vector store, no background indexer.

Agent-shaped retrieval defaults remain a deliberate design stance. Default snippetLines: 0 in DEFAULT_CONFIG still trades context for hit-only snippets. Scores are computed (the CLI exposes them behind --score for opt-in inspection) but the LLM-facing default output emphasizes file/line pointers over numeric ranks. The search SDK returns a typed SearchResult with file, score, links, modified, snippets, leaving the decision of what to render to the caller. The skill file, README, and pi-tool promptGuidelines all repeat the "overview → search → read" workflow so the agent hears the same ladder regardless of which surface it uses.

Native pi tools replace CLI shell-outs for LLM callers. The new pi-karpathy extension (.pi/extensions/pi-karpathy/index.ts) registers typed tools — napkin_overview, napkin_search, napkin_read, napkin_create, napkin_append, napkin_daily, napkin_links, napkin_lint, napkin_files, napkin_tags, napkin_tasks, napkin_properties — directly over the SDK, with TypeBox schemas and promptGuidelines that tell the model when to reach for each tool. napkin_lint is a genuinely new affordance: it fuses orphans, deadends, unresolved links, and missing-property checks into one vault-health call, optionally parameterized by a requiredProperties list. This is richer than "CLI passthrough" — it bundles operator-grade validation into an agent-callable surface.

Distill extension keeps LLM automation architecturally isolated. .pi/extensions/distill/index.ts is unchanged in spirit but tightened in detail. On session_start it starts a countdown timer (default 60 min), watches the session file byte size to skip idle intervals, forks the session to a temp dir via SessionManager.forkFrom, and spawns a separate pi subprocess with a fixed DISTILL_PROMPT. The distiller prompt is instructive: it tells the forked agent to run napkin overview, napkin template list/read, napkin search before creating, and to append to existing notes rather than duplicate. All promotion terminates in plain markdown via the same SDK every other caller uses — there is no private distillation store.

Obsidian-surface parity as an agent feature, not just a human-compat feature. The SDK now covers tags, tasks, bookmarks, aliases, properties (frontmatter read/set/remove/list), daily notes with folder-and-format config, templates, canvas file CRUD (nodes, edges, groups, colors), and Bases (.base YAML queries compiled into in-memory SQLite with Jexl formulas). The bet is that agents benefit from addressing the same higher-level vault constructs humans use, not a flattened agent-only substrate — so napkin_properties set status done is an agent-legible operation on the same YAML frontmatter an Obsidian user would edit.

Comparison with Our System

Dimension Napkin Commonplace
Primary substrate Obsidian-compatible markdown vault (.napkin/ holds config, content lives alongside .obsidian/) Repo-first KB (kb/notes/, kb/reference/, kb/work/), markdown files under git review
Integration surface CLI + TypeScript SDK + pi extensions (native tools + context injection + background distill) Skills in .claude/skills/.agents/skills plus commonplace-* commands, harness-loaded
Always-loaded context NAPKIN.md at L0; pi-karpathy or napkin-context extension injects overview as a custom message AGENTS.md/CLAUDE.md with routing, indexes, skill descriptions
Progressive disclosure Hard-coded L0-L3 ladder documented in three places (README, skill, tool guidelines) Distributed ladder: routing → descriptions → indexes → rg → read
Search BM25 over basename+content with backlink and recency boosts, fingerprinted cache rg + descriptions + explicit indexes
Retrieval evaluation Live benchmarks in-repo (LongMemEval, HotpotQA, LoCoMo) No benchmark harness; evaluation is design argument and semantic review bundles
Link model Wikilinks + backlinks; links are structural and support lint/orphan/deadend checks Explicit semantic link phrases ("foundation:", "extends:", "contrasts:") with relationship typing
Knowledge structure Folder templates + YAML frontmatter + tags + Bases views Typed notes (note, structured-claim, adr, index, related-system) with methodology for each
Learning/distillation pi-extension distill forks session, agent subprocess writes into vault via SDK; no private scored store Semantic review bundles, fix-workflow, human/agent-in-the-loop promotion through workshop layer
Vault lint napkin_lint tool: orphans, deadends, unresolved links, required-property coverage commonplace-validate plus review bundles

The two systems sit at different layers of the same general argument. Both trust plain markdown; both are skeptical of heavy vector stacks; both bet that progressive disclosure is the right answer for long-term agent memory. Where Napkin commits its structure is in agent-facing tooling: a ladder of four commands, a typed SDK, a persistent search cache, and now a benchmark harness that empirically defends the design. Where we commit our structure is in the documents: typed notes, semantic links, description discipline, and a theory of distillation/constraining/codification that explains why the artifacts take the shapes they do.

Napkin is still stronger on the interface we are deliberately thin on — it packages the "agent loop over a markdown KB" into a single installable tool. We are still stronger on the inside of the notes — type-driven curation, link semantics, review cadence, and methodology that persists across projects rather than per-vault template.

Borrowable Ideas

Benchmark the retrieval ladder. The single biggest change since the March review is that Napkin now has bench/longmemeval-eval.ts, bench/hotpotqa-eval.ts, and bench/locomo-eval.ts as reproducible harnesses. We have no equivalent — our argument for commonplace's structure is theory and design review. A minimal borrow would be to stand up one or two benchmark harnesses (LongMemEval-S is the approachable size) that drive our CLI and type system. Ready to try now as a small self-contained experiment; would be especially valuable for tight-scope comparisons (e.g., does our link-semantic discipline change retrieval accuracy vs. a plain-markdown baseline?).

Fingerprint-keyed search cache. The 0.6.0 cache (mtime+path MD5 fingerprint, MiniSearch JSON + docs + backlinks serialized to one file) is a tiny, inspectable amortization that removes most of the cost of re-indexing. If we ever build local search on top of descriptions/indexes, the fingerprinting pattern is ready to borrow — especially for the backlink/link-graph precomputation, where rebuild cost scales with vault size but change rate is low.

Vault lint as an agent-callable tool. napkin_lint bundles orphan notes, deadend notes, unresolved links, and missing required properties into one call with a requiredProperties parameter. Our validator does similar things per-file; an aggregate surface with configurable property coverage would be useful for review sessions. Ready to borrow now — we have the underlying signals.

Typed SDK for our CLI surface. Napkin's split into core/ (data) and commands/ (I/O) is a refactor pattern we could adopt if we start needing programmatic access to validators/indexers beyond shell callouts. Not urgent, but sets up cleanly for building agent tools on top. No use case yet.

Tool-level promptGuidelines as control flow. The pi-karpathy tools include structured guidelines like "Use napkin_overview first to understand vault structure before searching" and "Append to existing notes instead of creating duplicates — search first." These are not prose comments — they are returned to the LLM as tool hints. If we expose commonplace commands as agent tools, co-locating workflow hints with each tool is a better pattern than burying them in ambient instructions. Ready to borrow the day we wrap any of commonplace-* for agent use.

Distill-through-SDK, not distill-into-private-store. The distill extension still terminates in plain vault notes via the same SDK used by humans and agents. If we ever add trace-derived capture, this is the right architectural pattern: the extractor should emit the same artifacts that a manual review would, using the same write path. Ready to borrow as a principle now.

Curiosity Pass

The benchmark numbers are strong but paint a narrower picture than the headline claim. 83% on LongMemEval_M at 100 questions with Sonnet is real and beats the GPT-4o RAG baseline. But the benchmark setup preprocesses the haystack into per-round markdown notes organized by day directory, with mtimes restored from session timestamps. That is not "zero preprocessing" — it is structured ingestion tuned to match napkin's overview/recency signals. The win is still meaningful; the phrasing oversells. What the benchmark actually shows is that when the ingestion schema matches napkin's retrieval assumptions, BM25+mtime+backlinks plus an agent loop can match or beat heavier stacks. That is a useful, narrower claim than the README makes.

The "BM25 + links * 0.5 + recency" score is a linear combination, not a learned reranker. The property it produces is: prefer fresh files with more inbound links, given term match. The mechanism mostly relocates scoring from MiniSearch's built-in ranking into a hand-tuned composite. The simpler alternative is BM25 alone; the more complex alternative is a learned reranker. Napkin's choice is pragmatic but brittle — the 0.5 and 1.0 weights are magic constants without justification in the code, and the recency normalization across all docs in the vault will make older fresh-on-topic notes look unattractive in a large vault. It works empirically at benchmark scales; the ceiling in a messy real vault is unclear.

napkin_lint is a richer affordance than the old CLI surface, but still diagnostic. It transforms signals (orphan/deadend/unresolved/missing property) into a readable report. It does not propose fixes, merge duplicates, or repair broken links. For a pure lint tool that is fine; for an "agent-actionable vault health" framing the absence of repair hooks is the limiting factor. The real question is whether we want our validator to stay diagnostic-only or to gain a fix-suggestion mode.

The pi-karpathy extension makes the SDK a more important surface than the CLI. With typed tools plus promptGuidelines, the agent never sees the CLI's output-formatting choices. That is architecturally cleaner, but it also means the "score-hiding defaults" and "hint text as control flow" ideas that were distinctive in the CLI are now duplicated in the tool-description strings. Both surfaces must stay in agreement for the agent-behavior story to hold. This is a maintenance burden that did not exist when there was only one surface.

The distill extension has not grown a verification layer. Same critique as before: forking a session and letting a model write notes produces plausible artifacts, but the loop has no oracle, no deduplication across runs beyond "search before you write," and no measurement of whether distilled notes later help or hurt retrieval. Even with the benchmark harness in-repo, no benchmark evaluates distill's output — the benchmarks measure retrieval-plus-reasoning on pre-written notes. The distill loop is weakly governed distillation: it compresses session traces into notes, but has no oracle, no deduplication beyond search-before-create, or benchmark showing whether the resulting notes help later retrieval.

The repo identity consolidated cleanly. The March review flagged the the-shift-dev/Michaelliv split as a maintenance signal. That split is resolved: remote, package metadata, CLAUDE.md, and README all point at Michaelliv/napkin. Nothing to watch here anymore.

Trace-derived learning placement. Napkin still qualifies. Trace source: the distill extension consumes opaque pi session files (read via ctx.sessionManager.getSessionFile()), monitoring byte-size growth to decide when to fork. It does not parse messages; it delegates consumption to a subprocess pi invocation. Trigger boundaries: timer interval (default 60 min), manual /distill, and a session-file-size change gate. Extraction: not a transcript-to-JSON schema. The DISTILL_PROMPT gives the forked agent a fixed procedure — napkin overview, napkin template list/read, search before create, use templates, add wikilinks — and lets it decide what is worth capturing. The oracle is the distiller-agent's own judgment, not a scorer. Promotion target: plain markdown notes written through the same SDK any caller uses. No private store, no scoring, no service-owned memory layer. Scope: per-vault, single-agent, single-session. The fork-and-subprocess pattern isolates the distill run but does not coordinate across vaults or sessions. Timing: online, during deployment, with a countdown timer and a size-change gate. No offline staged cycles; no cross-session aggregation. On the survey's axes: axis 1 (ingestion pattern) — live session mining with a timer trigger, structurally closest to Napkin's existing survey placement; axis 2 (artifact vs weights) — symbolic artifacts only, no weight promotion. This update does not split or weaken the current survey entry for Napkin; the distill mechanism is unchanged in design. What is new relative to trace-derived learning is context-adjacent: the pi-karpathy extension now injects the overview as a custom message on session start, meaning always-loaded context and trace-derived learning terminate in the same artifact (notes) and share a read path. That does not need a new subtype in the survey, but it is a nice example of a system where the "what gets loaded" and "what gets written" sides of trace-derived learning use the same tool surface.

What to Watch

  • Whether the benchmark harness expands to evaluate distill output (not just retrieval), which would close the "does distilled content help later tasks?" loop the design currently assumes without measuring.
  • Whether the search score composite (BM25 + links * 0.5 + recency) stays hand-tuned or gets replaced with a reranker once larger real vaults stress the magic constants.
  • Whether napkin_lint gains repair affordances (merge orphans, resolve broken links, auto-add missing properties) or stays diagnostic-only.
  • Whether the SDK's method surface stabilizes at v1.0 — at 30+ methods it already covers most of Obsidian's higher-level affordances, but the constructor shape and scaffolding surface still moved this cycle (0.6/0.7).
  • Whether the pi-karpathy tool catalog (typed schemas + promptGuidelines) becomes a template other markdown-KB systems copy, making "wrap your KB as pi tools" a standard adoption path.

Relevant Notes: