Improvement log

Append one line per observation. Don't fix anything — just record it.

Format: - path/to/note.md: what needs improving

  • ABSTRACTION: [memory-management-policy-is-learnable-but-oracle-dependent (AgeMem: trajectory->weights), trajectory-informed-memory-generation source (trajectory->tips), constraining-during-deployment-is-continuous-learning (deployment experience->artifacts)] share unnamed structure: trajectory-to-improvement mechanisms that differ by output substrate (weights vs tips vs files), with the substrate determining the inspectability-automation trade-off
  • ABSTRACTION: [agents-md-should-be-organized-as-a-control-plane, context-engineering, why-ai-systems-dont-learn-and-what-to-do-about-it source] share unnamed structure: control-plane abstractions recur at repository, prompt, and learning-architecture levels, but the transfer conditions between those levels are not yet named
  • SYNTHESIS: [three-space-agent-memory-maps-to-tulving-taxonomy (content-type axis), multi-agent-memory-computer-architecture-perspective source (hierarchy-level axis)] — two independent decompositions of agent memory from different traditions (cognitive science vs computer architecture) that together predict a two-axis taxonomy: content type x hierarchy level
  • ABSTRACTION: [llm-context-is-composed-without-scoping (single-agent scoping failure), multi-agent-memory-computer-architecture-perspective source (multi-agent consistency challenge), synthesis-is-not-error-correction (error amplification from uncoordinated agents)] share unnamed structure: shared mutable state without coordination primitives, manifesting at different scales (within-context, across-agents, across-outputs)
  • ABSTRACTION: [memory-management-policy-is-learnable-but-oracle-dependent (AgeMem RL), trajectory-informed-memory-generation source (tip extraction), openclaw-rl source (live RL from next-state signals)] share unnamed structure: oracle-gated learning from interaction — all three use task-level oracles to learn from deployment interactions but differ on what they produce (policy in weights, tips in text, behavioral improvement in weights), forming a substrate spectrum that the KB's deploy-time learning framework only covers at the artifact end
  • kb/notes/agent-orchestration-needs-coordination-guarantees-not-just-coordination-channels.md: table of composition modes (contamination/inconsistency/amplification) should add a fourth row for accountability vacuum in delegation chains, with liability firebreaks as the missing primitive — the Tomasev delegation paper provides the source material
  • SYNTHESIS: [decomposition-heuristics-for-bounded-context-scheduling (decompose for context efficiency), the-boundary-of-automation-is-the-boundary-of-verification (automate where verification is cheap), intelligent-ai-delegation source (decompose until verification is feasible)] — verification cost as the unified governing constraint for both task decomposition and delegation decisions, currently stated in each source independently but not unified
  • kb/work/prompt-bottleneck/how-a-system-built-around-prompt-limitations-would-look.md: workshop text with no frontmatter — needs /convert before connections can be formalized
  • kb/work/prompt-bottleneck/how-a-system-built-around-prompt-limitations-would-look.md: "Notes for mining later" section identifies 5 potential standalone claims — strong split candidate
  • kb/notes/context-engineering.md: this note defines context engineering narrowly (routing, loading, scoping, maintenance) but the prompt-bottleneck exploration argues the scope should include storage format, retrieval architecture, knowledge lifecycle, session boundaries, and inter-agent communication — tension worth surfacing
  • ABSTRACTION: [session-history-should-not-be-the-default-next-context (session level), llm-mediated-schedulers-are-a-degraded-variant (scheduler level), conversation-vs-prompt-refinement-in-agent-to-agent-coordination (agent-agent level), how-a-system-built-around-prompt-limitations-would-look (system architecture level)] share unnamed structure: "conversational architecture is the universal degraded variant" — each note argues against conversation-as-default at a different architectural layer
  • kb/notes/rlm-has-the-model-write-ephemeral-orchestrators-over-sub-agents.md: missing source citation — the neural_avb RLM thread (kb/sources/recursive-language-models-what-finally-gave-me-the-aha-moment-2035040781074145412.md) is primary evidence for this note's claims but is not cited
  • kb/sources/recursive-language-models-what-finally-gave-me-the-aha-moment-2035040781074145412.md: raw capture with no .ingest.md — needs /ingest to produce structured analysis before formal source-layer connections
  • kb/notes/evaluation-index.md: generated "Other tagged notes" section is empty despite evaluation-tagged notes like evaluation-automation-is-phase-gated-by-comprehension.md — rebuild or investigate generator/tag coverage
  • kb/notes/the-boundary-of-automation-is-the-boundary-of-verification.md: load-bearing synthesis note is still seedling while multiple notes use it as a foundation; maturity/status likely needs review
  • SYNTHESIS: [bounded-context-orchestration-model (topology/decomposition), llm-context-is-composed-without-scoping (scope isolation), error-correction-works-above-chance-oracles-with-decorrelated-checks (verification)] — the Xinming Tu source proves these three form a causal chain (topology creates decomposition boundaries, isolation manufactures verifiable units, verification exploits the structure) but no KB note yet names this dependency ordering as a unified principle
  • kb/notes/scheduler-llm-separation-exploits-an-error-correction-asymmetry.md: status is speculative but the Xinming Tu structured-test-time-scaling source provides formal evidence for the same separation — candidate for status promotion
  • SYNTHESIS: [knowledge-storage-does-not-imply-contextual-activation (relevant knowledge fails to activate), agent-context-is-constrained-by-soft-degradation-not-hard-token-limits (irrelevant knowledge dilutes attention)] — indiscriminate context loading produces a double failure (false salience from irrelevant + activation failure from relevant) both invisible due to soft degradation; no note yet names this combined failure model
  • kb/notes/knowledge-storage-does-not-imply-contextual-activation.md: open question "does more context help or hurt by diluting cues?" is directly answered by agent-context-is-constrained-by-soft-degradation — the volume dimension of soft degradation IS cue dilution; the two notes should cross-link
  • SYNTHESIS: [induction-bias-sequence-models-ebrahimi-2026.ingest (algebraic state tracking), convexbench-can-llms-recognize-convex-functions.ingest (compositional reasoning), pathway-beyond-transformers-sudoku-bench (constraint satisfaction)] — three sources from different problem domains converging on "transformers fail at calculator-class structured reasoning despite trivial verification"; could become a catalog note grounding the bitter-lesson-boundary with diverse empirical evidence, assessed by evidence quality
  • kb/notes/bitter-lesson-boundary.md: the Relevant Notes section has grown to 7 entries and now includes two source-layer papers providing quantitative evidence, but the note body doesn't reference them — the grounding evidence lives only in the link annotations
  • kb/sources/pathway-beyond-transformers-sudoku-bench.md: no .ingest.md exists; the source makes strong claims that need structured limitations analysis before connections carry weight
  • SYNTHESIS: [superarc source (recursive compression), esolang-bench source (code generation OOD), pathway-sudoku source (constraint satisfaction), ebrahimi induction-bias source (state tracking)] — four sources now converge on "LLMs score near zero on well-specified problems that strip training-distribution shortcuts"; the log already flagged a three-source version (line 26) and SuperARC is a fourth independent domain with the strongest methodology (AIT-grounded, mathematically proven equivalence between compression and prediction)
  • kb/sources/superarc-ait-benchmark-llm-compression-abstraction.md: no .ingest.md exists; the source is rich enough to warrant proper ingest with domain classification, author credibility assessment, and extractable value catalog
  • kb/notes/reverse-compression-is-the-failure-mode-where-llm-output-expands-without-adding-information.md: the note lacks examples outside KB writing; SuperARC's print-statement-only code generation (programs that simply print the target sequence) is reverse-compression in a formal benchmark context and would strengthen the note's empirical grounding
  • kb/notes/first-principles-reasoning-selects-for-explanatory-reach-over-adaptive-fit.md: the note needs more concrete empirical examples of reach-vs-fit; SuperARC's integer-vs-binary sequence performance gap (LLMs perform well on integer sequences from training data but fail on binary sequences requiring genuine compression) is a clean illustration
  • SYNTHESIS: [gsm-dc source (math reasoning with distractor injection), convexbench source (compositional symbolic reasoning), llm-webagents-long-context-reasoning-benchmark source (multi-session web agent tasks)] — three sources document the same irrelevant-context degradation pattern at different abstraction levels; no note yet names this cross-level consistency or argues that mitigation must be architectural (scoping/selective loading) rather than content-level (summarization/RAG)
  • kb/notes/agent-context-is-constrained-by-soft-degradation-not-hard-token-limits.md: already cites GSM-DC; should also cite the web agent benchmark as evidence extending the soft-degradation claim from isolated reasoning to multi-session agentic tasks
  • kb/notes/effective-context-is-task-relative-and-complexity-relative-not-a-fixed-model-constant.md: open question asks "Which natural-language tasks exhibit the same complexity-dominant collapse that ConvexBench shows in symbolic reasoning?" — the web agent benchmark is a partial answer
  • kb/notes/process-structure-and-output-structure-are-independent-levers.md: the note has no empirical evidence for independent degradation under noise; GSM-DC's PAcc vs SAcc metrics under distractor injection provide exactly that evidence (Finding IV: IC degrades path selection AND arithmetic execution independently)
  • kb/sources/gsm-dc-llm-reasoning-distracted-irrelevant-context.md: no .ingest.md exists; the source is rich enough to warrant /ingest with structured value extraction, especially the training intervention findings (Findings III-V) and PRM methodology
  • kb/notes/effective-context-is-task-relative-and-complexity-relative-not-a-fixed-model-constant.md: GSM-DC's controlled DAG methodology directly answers the open question about "a clean empirical regime where volume can be varied while task difficulty and compositional complexity stay mostly fixed" — this should be cited
  • SYNTHESIS: [agent-context-is-constrained-by-soft-degradation (qualitative two-axis model), convexbench source (depth axis), paulsen-mecw source (volume axis), gsm-dc source (interaction: delta(rs) grows with depth)] — GSM-DC uniquely provides the quantified interaction term between the two axes; a synthesis note could formalize the cost surface E(m;rs) ~ m^delta(rs) with all three empirical pillars