Workshop: review-revise-gated
Goal: find review and revision arrangements that reliably produce the kinds of improvements we made manually to the session-history note, then codify those arrangements as reusable instructions.
Status: active. Using the gate-based review system. Gates are copied into gates/ for experiment isolation.
Materials
baseline.md— the note as of3450a4f(2026-03-20), before any editstarget.md— the note after manual review and revision (2026-03-25)change-catalogue.md— 16 named changes across 4 categories (accessibility, clarity, structure, cosmetic), each with baseline text, problem, and desired directiongates/— local copy of review gates, organized by bundle (accessibility, complexity, frontmatter, prose, semantic, sentence, structural)
Instructions
run-review.md— apply gates to baseline.md, write per-bundle review files to a run directoryrun-revise.md— revise baseline.md based on review findings, write revised note to run directory
Experiment protocol
- Start from
baseline.md - Run
run-review.mdwith the desired bundles, writing to a new run directory - Run
run-revise.mdon the review findings, writingrevised.mdto the same run directory - Score the result against
change-catalogue.md→{run}/scores.md
Scoring
Each experimental run scores against the change catalogue:
- Hit — makes a change in the same direction (not necessarily identical text)
- Miss — doesn't catch the problem
- Mistake — introduces a new problem or moves in the wrong direction
The score is: hits / 16 for coverage, with mistakes as a separate penalty count. A good arrangement has high hits, zero mistakes.
Results
Summary across all runs
Early runs (01-04) used monolithic review instructions. Runs 05-08 used the gate system. Run data for 01-07 was removed; run-08 is the current baseline.
| Metric | Runs 01-02 (monolithic) | Runs 03-04 (v2 monolithic) | Runs 05-07 (gates) | Run-08 (gates, tuned) |
|---|---|---|---|---|
| Revision hits | 6 | 11-12 | 9-11 | 11 |
| Wrong direction | 0-1 | 0-1 | 0 | 0 |
| Factual errors | 0-1 | 0 | 0 | 0 |
| Detection (WARN+INFO) | 2-9 | 11 | 9-14 | 14 |
Run-08: current gate system
34 gates across 7 bundles. Key gate additions from the experiment:
- prose/bridge-paragraph-duplication — split from redundant-restatement to force separate checking
- sentence/concept-attribution — catches prose claims about what linked notes contain
- sentence/clause-packing — catches revision bloat (sentences overloaded with clauses)
Revision: 11/16 hits, 0 wrong-direction, 0 factual errors.
Detection: 14/16. Two items undetected (S2 merge sections, S3 compress taxonomy). Two items detected but not fixed: C3 at INFO (concept-attribution), S1 partially fixed (bridge paragraph trimmed but not deleted).
See run-08/scores.md for per-change scoring and run-08/gate-noise-audit.md for gate reliability analysis.
Remaining gaps
- S2 (merge sections) — no gate checks for section-merging opportunities
- S3 (compress taxonomy) — complexity gates consistently approve the taxonomy; completeness gate pushes toward expansion (wrong direction). Irreducible editorial judgment
- S1 (bridge paragraph) — detected but revision trimmed wrong part. Detection solved; revision quality is the bottleneck
- C3 (concept attribution) — detected at INFO, needs WARN to trigger action
Key findings
Gate granularity matters. Decomposing monolithic reviews into individual gates makes it possible to improve individual checks, add new checks, and select relevant subsets per note.
Separate gates for separate patterns. When two failure modes share a gate, the reviewer satisfies the gate by finding either one. Splitting redundant-restatement into section-opening and bridge-paragraph gates was necessary to catch S1.
Exhaustive-checking language helps. Adding "report all instances, not just the first" to gate tests improved detection coverage.
Severity thresholds drive revision behavior. The reviser treats WARN as actionable and INFO as optional. Five gates have unstable severity (framing-mismatch, general-before-specific, unidentified-references, redundant-restatement, confidence-miscalibration) — they flip WARN/INFO across runs on the same baseline. The noise audit (run-08/gate-noise-audit.md) documents specific threshold fixes.
Competing findings create wrong-direction risk. The completeness-boundary-cases gate correctly detects taxonomy gaps but recommends expansion when compression is the right fix. The gate now includes guidance to consider compression before recommending expansion.
Detection ceiling: 14/16. Revision ceiling: ~12/16. The two undetected items (S2, S3) are editorial judgments that resist gate formulation. Iteration (reviewing revised output again) can recover 1 additional hit, as demonstrated in early runs.