Ingest a directory
Type: kb/types/instruction.md
Target: $ARGUMENTS — path to a directory containing the source material (typically under tmp/ or kb/work/<workshop>/, both gitignored). Do not place cloned repos under kb/sources/ — that directory is tracked and the .ingest.md is the only artifact per source that belongs there.
If no target, ask the user for a directory path.
Slug. The slug is the basename of the input directory path — e.g. tmp/position-bias/ → slug position-bias. The ingest report is always written to kb/sources/<slug>.ingest.md, regardless of where the source directory lives.
When to use
- The source is a tree of related files, not a single document — most commonly a cloned code repository, but also a paper plus supplementary material, or a grouped set of related snapshots.
- Cross-file signal matters: the README claims X and a test file measures X; the README alone would not carry the same evidentiary weight.
For single-file sources, use /cp-skill-ingest instead.
Prerequisites
- The directory exists at the given input path and contains the source material. Cloning the repo or downloading the files is out of scope — do it beforehand, into a gitignored location (e.g.
tmp/<slug>/). kb/sources/COLLECTION.mdexists (required by the connect skill invoked in Step 5, which runs against the ingest report that lands inkb/sources/).
Step 1: Explore the directory
Run ls and inspect the tree shape. Classify files into categories:
- Thesis / intent — README, top-level docs, project description
- Implementation — main code files, entry points, core modules
- Validation — tests, evaluations, benchmarks, example runs
- Artifacts — prompts, datasets, configs, experimental parameters
- Output — results files, logs, figures (if committed)
Ignore: vendored dependencies (node_modules/, .venv/), lockfiles, compiled artifacts, .git/, generated docs.
Step 2: Select important files
Pick a bounded set (typically 5–15 files) to read in full. Bias toward load-bearing files, not completeness.
Explicitly include-or-skip each of these categories:
- README / top-level docs — usually include
- 1-3 core implementation files that carry the central claim — include
- Tests or evaluations for the central claim — include if the source's trustworthiness rests on "the code runs"
- Prompts, datasets, or configs that define what the system actually does — include if the source is an LLM artifact
- Individual demos or examples — usually skip unless the demo IS the claim
- Aggregate results / outputs (CSVs, result tables, figures) — include when the repo is a data publication (findings are the contribution; the code is scaffolding). Skip when the repo is a software tool (outputs are throwaway artifacts of running it).
- Vendored, generated, or build-artifact files — skip
Record your selection as a one-line justification per file; this becomes the File Manifest in Step 4.
Step 3: Read the selected files
Read each selected file in full. Form a composite understanding:
- What is the source's central claim or contribution?
- What evidence does the tree carry for that claim (code, tests, data)?
- What claims in the README are not supported by code or tests in the tree?
- What is the source's scope — what does it NOT claim or cover?
Step 4: Write a draft ingest report
Write to kb/sources/<slug>.ingest.md, where <slug> is the basename of the input directory (e.g. input tmp/position-bias/ → output kb/sources/position-bias.ingest.md). Fill only Classification, Summary, and File Manifest. Leave the four connect-informed sections (Connections Found, Extractable Value, Limitations, Recommended Next Action) as the literal string TO BE FILLED.
Do not pre-draft the four placeholder sections even though Step 3 gave you the material to write them. The point of deferring them is that the connect report in Step 5 reshapes what counts as "new" (Extractable Value should exclude anything the KB already captures) and surfaces tensions (Limitations often cites KB notes the agent hasn't considered). Pre-drafting forces a rewrite and loses the filter connect provides.
Draft frontmatter:
---
description: {one-line retrieval filter — what makes this source distinctive}
source_snapshot: {input-directory-path, e.g. tmp/<slug>/}
ingested: "{current UTC date}"
type: kb/sources/types/ingest-report.md
source_type: code-repository
domains: [{tag1}, {tag2}, {tag3}]
---
Note: source_snapshot points to the working copy, which is typically gitignored and ephemeral. The Pin line below (commit hash) is the canonical identifier for reproducibility.
Enum gap. source_type: code-repository is not in the current kb/sources/types/ingest-report.schema.yaml enum. Extend the enum (add code-repository to the list) as a companion change; otherwise validation will fail. If the directory is not a code repo (e.g. paper + supplements), pick the closest existing value or extend the enum for that case too.
Draft body:
# Ingest: {repo or project name}
Source: {input-directory-path} (ephemeral; see Pin)
Captured: {date from README/git if known}
From: {upstream URL if known, e.g. GitHub repo URL}
Pin: {commit hash if known, else "unpinned — captured HEAD at <date>"}
## Classification
Type: code-repository — {brief justification, e.g. "working eval harness with ~N tests"}
Domains: {tag1}, {tag2}, {tag3}
Author: {credibility signal, or "unknown"}
## Summary
{one paragraph — central claim and how the tree supports it}
## File Manifest
Files read in full:
- `path/to/file1.md` — {one-line justification}
- `path/to/file2.py` — {one-line justification}
- ...
## Connections Found
TO BE FILLED
## Extractable Value
TO BE FILLED
## Limitations (our opinion)
TO BE FILLED
## Recommended Next Action
TO BE FILLED
Step 5: Run connect on the draft
Invoke /cp-skill-connect kb/sources/<slug>.ingest.md.
Connect reads the draft (Classification + Summary + File Manifest carry enough signal for prospecting) and writes its report to kb/reports/connect/sources/<slug>.ingest.connect.md.
Wait for the skill to complete before proceeding.
Step 6: Revise the ingest report using connect output
Read the connect report. Replace the four TO BE FILLED sections.
Relative-link depth. Links you author here are written from kb/sources/<slug>.ingest.md. Use ../notes/…, ../reference/…, ../agent-memory-systems/…, ./<other-source>.md for sibling sources. The connect report lives at a different depth (kb/reports/connect/sources/…), so do not copy its link strings verbatim — rewrite paths relative to the ingest report.
Connections Found — summarise which notes the source connects to, relationship types, and the key insight about how this source fits (or doesn't) into the KB graph.
Extractable Value — 3–7 items, each with an effort tag ([quick-win], [experiment], [deep-dive], [just-a-reference]). Focus on what is NEW relative to the connections found. For a code repository specifically, look for:
- Empirical findings the code produces (cite the test or eval file)
- Methods or experimental designs adaptable to our work
- Prompts, benchmarks, or datasets worth reusing
- Claims in the README that the code does NOT support (trustworthiness gap)
Assess reach — does the finding transfer beyond this specific benchmark, model set, or domain? Flag low-reach items.
Limitations (our opinion) — code-repository checks:
- Did the agent verify the code runs, or only read it? State which.
- Missing baselines, restricted model sets, cherry-picked benchmarks
- README claims not backed by code or tests in the tree
- Single-author / single-team context — would the result transfer?
- Unpinned repo state — the analysis rots as the repo evolves; the Pin line above should record the commit or capture date
Recommended Next Action — one specific action (new note with title, update to existing note, brainstorm topic, file as reference).
Verify
kb/sources/<slug>.ingest.mdexists and has noTO BE FILLEDmarkers.- File Manifest entries match the files actually read in Step 3.
- Every relative link in Connections Found resolves.
- Frontmatter passes schema validation (with the enum extension applied).
Tell the user where the report was saved and the recommended next action.
Do NOT
- Do not extract atomic claims from individual files — this is ingestion, not decomposition.
- Do not write files under
kb/notes/,kb/reference/, orkb/instructions/. The deliverable is the.ingest.mdonly. - Do not modify files inside the source directory.
- Do not skip Step 5 — connect results are load-bearing for Connections Found and Extractable Value.
- Do not run
/cp-skill-connecton each selected file individually; one pass on the draft ingest-report is the design.