sift-kg
Type: ../types/agent-memory-system-review.md · Status: current
sift-kg is a Python CLI by Juan Ceresa for turning local document collections into a NetworkX knowledge graph, then serving that graph through terminal query commands, export formats, an HTML viewer, and an optional bundled agent skill. It is best read as an extraction-first knowledge construction pipeline: raw documents are knowledge artifacts, prompt/config/schema files are system-definition artifacts, and the generated graph becomes a durable knowledge artifact for humans and agents, while review queues and CLI flags decide how much authority the graph has after extraction.
Repository: https://github.com/juanceresa/sift-kg
Reviewed revision: d786991c024f5401f113fc0cb70aee96dd1bd3bf
Core Ideas
The storage substrate is a local project directory, not a database. Runtime configuration comes from CLI flags, SIFT_ environment variables, .env, and sift.yaml, with a default output/ directory created by SiftConfig (src/sift_kg/config.py). Raw documents remain in the user's input directory. Derived state is file-backed: output/extractions/*.json, output/discovered_domain.yaml, output/graph_data.json, output/communities.json, output/entity_descriptions.json, output/narrative.md, output/graph.html, output/merge_proposals.yaml, and output/relation_review.yaml (src/sift_kg/extract/extractor.py, src/sift_kg/pipeline.py, src/sift_kg/resolve/io.py). SQLite is only an export target, not the live store (src/sift_kg/export.py).
Document ingestion preserves text provenance but not a full source archive. sift extract recursively discovers supported files, reads them through a Kreuzberg backend for many document formats or a legacy pdfplumber backend, and can route near-empty PDFs to OCR via local engines or Google Cloud Vision (src/sift_kg/ingest/reader.py, src/sift_kg/ingest/kreuzberg_extractor.py, src/sift_kg/ingest/ocr.py). The chunker uses character windows with 10% overlap and sentence-boundary fallback, carrying start_char, end_char, chunk_index, and total_chunks in memory (src/sift_kg/ingest/chunker.py). Persisted extraction JSON stores document path, chunk count, model, domain, chunk size, timestamp, entity contexts, and relation evidence, but not the chunk offsets themselves (src/sift_kg/extract/models.py, src/sift_kg/extract/extractor.py).
Schemas are behavior-shaping system-definition artifacts. The default schema-free path samples up to five first chunks, asks an LLM to design entity and relation types, and saves the result as discovered_domain.yaml for reuse; bundled or custom domains instead provide YAML entity types, relation types, source/target direction constraints, extraction hints, canonical-name restrictions, fallback relations, review-required flags, and prompt-injected system_context (src/sift_kg/domains/discovery.py, src/sift_kg/domains/models.py, src/sift_kg/domains/loader.py, src/sift_kg/domains/bundled/academic/domain.yaml). This is system-definition authority because schema content constrains what the LLM may extract, how relations are normalized, which directions are legal, and which relations enter review.
Extraction is prompt-governed, cached, and confidence-bearing. For each document, sift-kg generates a short document context, then sends each chunk through a single combined JSON prompt that asks for entities and explicit relations with confidence, context quotes, and evidence quotes (src/sift_kg/extract/prompts.py, src/sift_kg/extract/extractor.py). The LiteLLM client supplies provider portability, JSON repair, retries, rate limiting, token/cost accounting, and low temperature calls (src/sift_kg/extract/llm_client.py). Cached extraction files are reused unless the model, domain name, or chunk size changes, so the extraction JSONs are derived knowledge artifacts with explicit but coarse invalidation metadata (src/sift_kg/extract/extractor.py).
Graph construction normalizes LLM output before it becomes the main memory surface. sift build loads extraction JSONs, creates deterministic {type}:{normalized_name} entity IDs, adds DOCUMENT nodes and MENTIONED_IN provenance edges, pre-deduplicates near-identical entity names, canonicalizes repeated relation triples, and aggregates relation confidence from repeated mentions by product complement by default (src/sift_kg/graph/builder.py, src/sift_kg/graph/prededup.py, src/sift_kg/graph/knowledge_graph.py). Post-processing activates passive relations, removes self-loops and some transitive LOCATED_IN redundancy, prunes entities with only metadata edges, normalizes undefined relation types, and flips edges whose endpoint types contradict domain direction constraints (src/sift_kg/graph/postprocessor.py). The graph is a mixed-form retained artifact: symbolic JSON graph plus prose evidence strings and confidence fields.
Review gates are explicit but mostly queue-based. Entity resolution asks an LLM to propose duplicate entity merges and variant EXTENDS relations, optionally using sentence-transformer clustering to batch candidates; cross-type identical names create deterministic merge proposals (src/sift_kg/resolve/resolver.py, src/sift_kg/resolve/clustering.py). Low-confidence relations and domain-marked relation types are written to relation_review.yaml, while merge proposals go to merge_proposals.yaml; both use DRAFT, CONFIRMED, and REJECTED states and an interactive Rich terminal reviewer with auto-approve/reject thresholds (src/sift_kg/graph/builder.py, src/sift_kg/resolve/models.py, src/sift_kg/resolve/reviewer.py). sift apply-merges rewrites graph nodes/edges for confirmed merges and removes rejected relations (src/sift_kg/resolve/engine.py). The queue files are system-definition artifacts while pending and decided because they authorize graph mutation.
Narrative and viewer outputs are derived presentation surfaces, not sources of truth. sift narrate builds an overview from top entities and relations, generates relationship-chain and timeline prose, writes per-entity descriptions, rewrites banned generic phrases, and can replace placeholder communities with LLM-named themes (src/sift_kg/narrate/generator.py, src/sift_kg/narrate/prompts.py). sift view strips metadata nodes/edges, applies graph filters, generates a pyvis HTML graph, and injects custom JavaScript controls for search, type/relation/community filters, detail sidebars, trails, and community regions (src/sift_kg/visualize.py, src/sift_kg/viewer/app.js). sift search, sift query, sift topology, and sift info --json are the more direct agent-facing surfaces over the graph (src/sift_kg/cli.py).
The bundled agent skill turns the graph into persistent advisory context. The repository ships .agents/skills/sift-kg/SKILL.md, which tells an agent to orient at session start with sift info --json and sift topology, query entities before answering domain questions, link disconnected communities, generate graph-grounded suggestions, and confirm before running LLM-costing extraction or resolution (.agents/skills/sift-kg/SKILL.md). The graph remains a knowledge artifact when it grounds answers; the skill is a system-definition artifact because it gives agents behavioral instructions about when and how to consume that graph.
Comparison with Our System
| Dimension | sift-kg | Commonplace |
|---|---|---|
| Primary substrate | Raw documents plus derived JSON/YAML/Markdown/HTML files | Authored markdown notes, type specs, indexes, and workflows |
| Knowledge construction | LLM extracts entities/relations from documents into a graph | Agents/humans write typed notes and curated links directly |
| Schema authority | Domain YAML or LLM-discovered schema steers extraction and graph normalization | Collection/type specs and link vocabularies steer authoring and validation |
| Provenance | Document IDs, document paths, entity context quotes, relation evidence, support counts/docs | Source links, authored citations, frontmatter, review status, links, validation |
| Review | Merge and relation queues mutate graph after human/threshold decisions | Git review, deterministic validation, semantic review, explicit note workflows |
| Retrieval/activation | CLI graph queries, topology summaries, search, HTML viewer, exports, bundled skill | rg, descriptions, indexes, authored semantic links, skills, validation commands |
| Lifecycle | Re-extract on model/domain/chunk-size changes; rebuild graph; apply review decisions | Edit canonical notes directly; regenerate indexes; retire or replace artifacts explicitly |
sift-kg is much stronger than commonplace at fast extraction from unstructured source documents. It can take a folder of PDFs, DOCX files, HTML, text, images, or examples, use a discovered or predefined schema, and generate an explorable graph without first asking an agent to hand-author notes. That is useful when the goal is initial map-making over a corpus rather than durable methodology writing.
Commonplace is stronger at artifact contracts. In sift-kg, a relation with evidence and confidence is useful, but the graph does not carry a collection-level writing contract, link semantics, status lifecycle, or review rationale beyond queue status and confidence. The authoritative behavior-shaping surfaces are mostly code, prompts, domain YAML, and the agent skill. In commonplace, the system-definition artifacts are more inspectable as prose conventions and type specs, while the knowledge artifacts are authored to be read, revised, and linked by agents over time.
The biggest design difference is where trust enters. sift-kg starts with LLM extraction and later adds review gates, confidence aggregation, and provenance fields. Commonplace starts with constrained artifact types and review procedures, then uses search and indexes to activate those artifacts. sift-kg's approach scales ingestion; commonplace's approach scales maintainability and interpretability.
Read-back: pull — agents deliberately query, search, inspect topology, or follow the bundled skill to consume the generated graph as advisory context.
Borrowable Ideas
Schema discovery as a temporary workshop accelerator. Worth borrowing for ingestion workshops, not for promoted library notes. A commonplace source-ingest workflow could ask an LLM to propose temporary entity/relation categories for a corpus, use them to navigate source evidence, and discard or rewrite them before promotion into durable notes.
Review queues as first-class files. Ready to borrow. merge_proposals.yaml and relation_review.yaml are simple, inspectable, and mutating only after explicit status changes. Commonplace could use the same pattern for candidate link merges, note relocation decisions, or extracted claim review before writing library artifacts.
Support counts and support-document aggregation on derived relations. Ready to borrow for any future graph or index layer. The mentions, support_count, support_documents, and support_doc_count fields give a cheap trust signal without pretending confidence is ground truth.
Agent-facing topology commands. Ready as an interface pattern. sift topology and sift query are shaped for bounded-context consumption: communities, bridges, isolated nodes, and entity neighborhoods are better agent inputs than a full graph dump.
Derived narrative with source-excerpt discipline. Worth studying, but not a direct replacement for notes. The banned-phrase rewrite loop and proportional evidence instructions are useful quality pressure for generated reports, but narrative output should remain a knowledge artifact derived from graph evidence, not a canonical system-definition artifact.
Curiosity Pass
"Every entity and relation links back to source" is directionally true but coarse. The extraction models keep context and evidence quotes, document IDs, document paths, relation support documents, and mention records. They do not persist chunk character offsets or page-level provenance in the graph, even though ingestion can format page markers. For many exploration tasks this is enough; for legal or audit-grade provenance it is weaker than the README's strongest language.
The default schema-free path still becomes schema-bound after discovery. The project markets "schema-free" extraction, but the implemented happy path discovers a concrete schema once, saves it, and reuses it unless forced. That is the right mechanism because it reduces cross-chunk drift, but the durable behavior-shaping artifact is still a schema file.
The graph is memory only when paired with an activation surface. graph_data.json by itself is a derived knowledge artifact. It changes agent behavior through sift query, sift topology, sift search --json, the viewer, exports, and especially the bundled skill. Without those surfaces it is just a stored graph.
Human review exists, but graph validity still depends heavily on extraction quality. The review gates cover duplicate entities, variant relations, low-confidence relations, and domain-required relation review. They do not verify every high-confidence extraction, every entity type assignment, every evidence quote, or every generated narrative claim.
The test suite covers the machinery more than end-to-end truth. Tests exercise extraction models/prompts, graph support aggregation, pre-deduplication, merge/relation review IO and application, config/domain loading, export support columns, CLI JSON surfaces, communities, view filters, LLM JSON parsing, and narrative helpers (tests/). That is good engineering coverage for deterministic behavior; it does not measure extraction precision or downstream answer quality.
This is not trace-derived learning. The code extracts from user-supplied documents and writes derived graphs, schemas, reports, review queues, and exports. It does not mine prior agent/human operation traces, session logs, tool trajectories, or rollout histories into durable behavior-changing rules, prompts, schemas, embeddings, or weights. The closest feedback loop is human approval of graph edits, but that is curation of a corpus-derived graph, not learning from operational traces.
What to Watch
- Whether provenance becomes page/offset-addressable instead of document-plus-quote only.
- Whether review gates expand from entity/relation decisions into extraction audits or narrative fact checks.
- Whether discovered schemas get versioning, diffing, or invalidation semantics when users hand-edit them.
- Whether the bundled agent skill becomes packaged into the install path and stays synchronized with CLI behavior.
- Whether graph growth across incremental extraction gets stronger lifecycle support for stale documents, changed source files, and superseded extraction artifacts.
Relevant Notes:
- Storage substrate - defined-in: separates raw document folders, derived JSON/YAML/Markdown/HTML outputs, and export formats
- Knowledge artifact - defined-in: classifies extracted graph, reports, source quotes, and topology outputs when consumed as evidence or context
- System-definition artifact - defined-in: classifies prompts, domain schemas, review queues, and the bundled agent skill when they steer extraction, review, or agent behavior
- Behavioral authority - defined-in: clarifies why the same graph is advisory context while schemas and review statuses carry stronger operational force
- Axes of artifact analysis - rationale: useful lens for separating sift-kg's source documents, derived graph, review files, prompts, schemas, and exports
- Cognee - compares-with: both build graph-shaped knowledge from documents, but Cognee is a database/poly-store knowledge engine while sift-kg is a local file-backed CLI pipeline
- Siftly - compares-with: both emphasize ingestion and derived artifacts, but Siftly focuses on deterministic/resumable high-volume loading while sift-kg focuses on LLM graph extraction and visualization