Sources Directory
Type: index
- "Creative Thinking" (blog-post)
- A-Mem: Agentic Memory for LLM Agents (academic-paper)
- A-MEM: Learning Operations Analysis (note) — Dissects A-MEM's four fully-automatic operations (construct, link, evolve, retrieve) — all accretive, none curative — identifying the missing vocabulary (delete, split, reorganize, assess quality) that separates accumulation from curation
- Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents (academic-paper)
- Agentic Code Reasoning (academic-paper) — Semi-formal reasoning templates (explicit premises, execution traces, formal conclusions) improve LLM code verification by 5-12pp across patch equivalence, fault localization, and code QA tasks
- Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents (academic-paper)
- Agentic Note-Taking 23: Notes Without Reasons (x-article)
- AI Components for a Deterministic System (An Example) (blog-post)
- Andrej Karpathy talks about "Claws" (blog-post)
- Automated linking improves retrieval but may degrade navigability (note) — Triangulates A-MEM, Notes Without Reasons, and the open-problem note — automated linking improves retrieval (QA benchmarks) but degrades navigability (agent trust in link infrastructure); the distinction is adjacency versus connection
- Beyond Transformers: Sudoku Bench (blog-post) — Pathway's BDH model achieves 97.4% accuracy on extreme Sudoku while leading LLMs score 0%, using the gap as evidence that transformer architecture has fundamental limits for constraint-satisfaction reasoning and arguing for post-transformer latent-space models.
- Can Complexity and Uncomputability Explain Intelligence? SuperARC: A Test for Artificial Super Intelligence Based on Recursive Compression (academic-paper) — Introduces SuperARC, an AIT-grounded benchmark showing frontier LLMs score near zero on recursive compression tasks and newer versions often regress, while neuro-symbolic CTM/BDM methods achieve perfect scores — evidence that statistical pattern matching differs fundamentally from algorithmic abstraction.
- Cognee: Knowledge Engine for AI Agent Memory (github-repo)
- Context Engineering for AI Agents in Open-Source Software (academic-paper)
- Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs (academic-paper) — Empirical study defining and measuring Maximum Effective Context Window (MECW) across 11 frontier LLMs — finds MECW is drastically smaller than advertised MCW, shifts by task type, and that large context windows cause hallucination rates to approach 100%.
- Continual Learning in Token Space (blog-post) — Letta reframes continual learning for agents as optimization over learned context rather than weights, arguing token-space memory is the primary transferable substrate for long-lived agents
- ConvexBench: Can LLMs Recognize Convex Functions? (academic-paper)
- Dario Amodei — "We are near the end of the exponential" (blog-post) — Anthropic CEO's capability-timeline predictions — verifiable domains get confident timelines, unverifiable ones get hedged, implicitly confirming oracle-strength thesis
- EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages (academic-paper) — OOD code benchmark using esoteric languages to separate transferable reasoning from benchmark memorization and contamination
- Evaluating Long-Context Reasoning in LLM-Based WebAgents (academic-paper) — Benchmark showing LLM-based web agents fail badly under long context with injected irrelevant task sequences — success rates drop from 40-50% to under 10% at 150k tokens, with loop and lost-objective failures dominating; implicit RAG provides only modest relief.
- From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence (academic-paper)
- Graphiti: Temporal Knowledge Graph for AI Agents (github-repo)
- Harness Engineering Is Cybernetics (x-article)
- Harness Engineering: Leveraging Codex in an Agent-First World (blog-post)
- How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark (academic-paper) — Introduces GSM-DC, a controlled benchmark using symbolic DAGs to systematically measure how irrelevant context degrades LLM reasoning — quantifies power-law error scaling with distractor count, and shows Hard-IC training plus PRM-guided tree search are the most effective robustness interventions.
- Improving AI Skills with autoresearch & evals-skills (x-article)
- Ingest: "Creative Thinking" (conceptual-essay) — Shannon's 1952 lecture cataloguing six explicit problem-solving operators (simplification, analogy, restatement, generalization, structural analysis, inversion) as a portable creative toolkit
- Ingest: A-MEM: Agentic Memory for LLM Agents (scientific-paper) — Zettelkasten-inspired flat agent memory with embedding linking and LLM-driven evolution — benchmark success without curation operations or inspectable links
- Ingest: Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents (scientific-paper) — Formal framework (ABC) extending Design-by-Contract to autonomous agents — introduces probabilistic compliance model (p,delta,k), Lyapunov drift bounds, hard/soft constraint separation with typed recovery, and a YAML DSL for specifying behavioral contracts
- Ingest: Agentic Code Reasoning (scientific-paper) — Semi-formal reasoning templates (explicit premises, execution traces, formal conclusions) improve LLM code verification by 5-12pp — empirical evidence for structure-as-distribution-selector and interpretation-narrowing with quantified cost (2.8x steps)
- Ingest: Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for LLM Agents (scientific-paper) — RL-trained unified LTM/STM memory policy for LLM agents — confirms memory management is learnable when task-completion oracles exist, but operates on opaque weights and low-reach facts
- Ingest: Agentic Note-Taking 23: Notes Without Reasons (conceptual-essay) — First-person agent testimony that propositional link semantics differ in kind from embedding adjacency, with a Goodhart corruption argument and an unresolved curation-scaling question
- Ingest: AI Components for a Deterministic System (An Example) (practitioner-report) — Evans argues that separating modeling (schema creation) from classification (schema application) tames LLM non-determinism — a practitioner case study of constraining via taxonomy freezing
- Ingest: Andrej Karpathy talks about "Claws" (conceptual-essay)
- Ingest: Beyond Transformers: Sudoku Bench (practitioner-report) — Company blog using Sudoku benchmark (97.4% vs 0% LLM) to argue transformers are fundamentally limited for constraint satisfaction; undisclosed BDH architecture, weak methodology, but adds a third problem domain to the architectural-limits evidence cluster alongside Ebrahimi and ConvexBench
- Ingest: Cognee: Knowledge Engine for AI Agent Memory (tool-announcement) — Pipeline-first knowledge engine with custom Pydantic schemas for LLM entity extraction, poly-store graph+vector design, and an undersized enrichment phase that concretely marks the boundary between automatable extraction and open enrichment problems
- Ingest: Context Engineering for AI Agents in Open-Source Software (scientific-paper) — First empirical study of AI context files across 466 OSS projects — provides naturalistic data on content categories, five writing styles as constraint strategies, add-then-modify evolution pattern, and 50% stagnation rate that grounds and challenges KB constraining theory
- Ingest: Context Is What You Need — The Maximum Effective Context Window for Real World Limits of LLMs (scientific-paper) — Empirical study measuring Maximum Effective Context Window (MECW) across 11 frontier LLMs — finds MECW is up to 99% smaller than advertised MCW, varies by task type, and that exceeding MECW drives hallucination rates toward 100%; directly grounds the KB's bounded-context theory with multi-model dose-response data
- Ingest: Continual Learning in Token Space (conceptual-essay) — Letta reframes continual learning as optimizing learned context rather than weights, but the KB's stronger frame is weight space versus repo artifacts, including codified procedures
- Ingest: ConvexBench: Can LLMs Recognize Convex Functions? (scientific-paper) — Benchmark proving LLM compositional reasoning collapses with depth (not token count), recovered by recursive decomposition with focused context — quantitative evidence for scheduling model predictions
- Ingest: Dario Amodei — "We are near the end of the exponential" (conversation-thread) — Anthropic CEO's capability-timeline predictions implicitly confirm oracle-strength thesis — verifiable domains (coding, math) get confident timelines while unverifiable domains (novel writing, science) get hedged ones
- Ingest: EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages (scientific-paper) — Esoteric-language code benchmark arguing standard coding scores mostly measure pretraining fit, with interpreter feedback beating textual critique on OOD tasks
- Ingest: Evaluating Long-Context Reasoning in LLM-Based WebAgents (scientific-paper) — Ingest of NeurIPS 25 workshop paper benchmarking LLM web agents under long context (25k-150k tokens) with injected irrelevant task sequences — provides agent-level empirical evidence for soft degradation, loop entrapment, and objective loss, extending GSM-DC's distractor findings to multi-session agentic tasks.
- Ingest: From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence (scientific-paper)
- Ingest: Graphiti: Temporal Knowledge Graph for AI Agents (design-proposal) — Graph-first agent memory with bi-temporal edge invalidation — the strongest counterexample to files-first architecture in the surveyed memory systems
- Ingest: Harness Engineering Is Cybernetics (conceptual-essay)
- Ingest: Harness Engineering: Leveraging Codex in an Agent-First World (practitioner-report) — Practitioner report on 1M LOC fully agent-generated codebase — harness engineering as constrain/inform/verify/correct, entropy management via background cleanup agents, error messages as dual-function constraining
- Ingest: How Is LLM Reasoning Distracted by Irrelevant Context? (scientific-paper) — Controlled benchmark quantifying how irrelevant context degrades LLM reasoning via power-law error scaling with distractor count — strongest empirical grounding for the soft-degradation thesis in this KB; training and inference-time mitigations tested.
- Ingest: Improving AI Skills with autoresearch & evals-skills (practitioner-report) — Three-take Auto Research field report where optimization only worked after manual error analysis, failure taxonomy design, and judge calibration across the Three Gulfs.
- Ingest: Intelligent AI Delegation (scientific-paper) — Google DeepMind delegation framework centers verifiability, liability, trust, and 11 task axes in agent delegation; notable for accountability vacuum and liability firebreaks in long chains
- Ingest: Language Models, Like Humans, Show Content Effects on Reasoning Tasks (scientific-paper) — Empirical demonstration that LLMs mirror human content effects on reasoning (syllogisms, NLI, Wason) — content bias survives scaling and instruction tuning but chain-of-thought partially restores content-independent reasoning
- Ingest: Large Language Model Agents Are Not Always Faithful Self-Evolvers (scientific-paper) — Causal-intervention paper showing compressed agent memories can improve systems yet fail faithfulness tests, making behavioral dependence the missing metric for self-evolving agents
- Ingest: Lessons from Building AI Agents for Financial Services (practitioner-report) — Production practitioner report on building AI agents for financial services — validates files-not-database at commercial scale (S3-first with derived PostgreSQL), documents skill shadowing as user-customization mechanism, and articulates "model eats scaffolding" as an explicit design principle with fiscal-period normalization as calculator-regime counterexample
- Ingest: Letta (MemGPT): Stateful Agents with Self-Managed Memory (design-proposal) — Agent memory platform where the LLM self-manages a three-tier memory hierarchy (core/recall/archival) using an OS analogy — the strongest existing exemplar of the agent-self-managed agency model, now evolving toward git-backed memory files
- Ingest: Mem0: Universal Memory Layer for AI Agents (tool-announcement) — Mem0's two-phase add pipeline (extract facts + LLM-judged CRUD reconciliation) is the purest production example of automated accretion-without-synthesis — now contextualized against eleven systems in the comparative review
- Ingest: Mesa Optimizers and Language Recursion (conceptual-essay) — Speculative essay arguing mesa optimizers may emerge suddenly because language recursion and learned search both compress many cases into reusable generative rules.
- Ingest: Meta-Harness: End-to-End Optimization of Model Harnesses (scientific-paper) — Controlled ablation showing raw execution traces (10 MTok/iter) outperform summaries and scores-only feedback by 10+ points in automated outer-loop harness search — strongest empirical case for diagnostic richness as a binding constraint on automated improvement
- Ingest: Minimum Viable Ontology / Domain Maps (conversation-thread) — Tweet thread proposing "minimum viable ontology" — the smallest term list to orient a newcomer in a domain — with a vibecoded prototype (domainmaps.co) and pedagogical framing via "conceptual thresholds"
- Ingest: Multi-Agent Memory from a Computer Architecture Perspective (scientific-paper) — Computer-architecture analogy for multi-agent memory — shared/distributed paradigms, three-layer hierarchy, consistency protocols as the critical unsolved problem
- Ingest: On the "Induction Bias" in Sequence Models (scientific-paper) — 190k-run empirical study showing transformers need orders-of-magnitude more data than RNNs for state tracking due to absence of step-by-step induction bias; introduces sharing factor kappa quantifying cross-length mechanism reuse
- Ingest: OpenClaw-RL: Train Any Agent Simply by Talking (scientific-paper) — RL framework that trains agents from live next-state signals (user replies, tool outputs, terminal feedback, GUI state) during deployment — collapses the training/deployment boundary and challenges the KB's three-timescale model by performing weight updates from interactions the agent is already having.
- Ingest: Post by @deepfates — LLM "memory" as context stuffing (conversation-thread) — Deepfates argues LLM "memory" is just context-stuffing that creates false salience (Chekov's gun), advocates agentic context-building, but concludes weight updates are necessary — directly contradicts this KB's durability-not-weights position
- Ingest: Post by @koylanai (conceptual-essay) — Argues that pairwise judging plus round-robin win rates is a better evaluation primitive than absolute scoring for open-ended LLM tasks with no hard ground truth
- Ingest: Professional Software Developers Don't Vibe, They Control (scientific-paper) — Empirical study (N=112) finding experienced developers control AI agents through SE practices, not vibe coding -- grounds constraining, underspecification, and programming-practices-transfer arguments
- Ingest: Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations (scientific-paper) — Empirical study measuring code LLM stability under emotion/personality prompt variations — finds performance and stability are decoupled objectives, smaller models can be more stable, and emotional prompting reveals confidence miscalibration invisible to standard benchmarks
- Ingest: Recursive Language Models - what finally gave me the 'aha' moment (practitioner-report) — Detailed practitioner walkthrough of RLM architecture via five-architecture comparison (direct gen, RAG, ReAct, CodeAct, CodeAct+subagents, RLM) — the most concrete evidence for REPL-as-substrate, symbolic variable return, and scaffold-level truncation in the KB
- Ingest: Skill Synthesis — Materializing Knowledge as Skills (practitioner-report) — Sentry co-founder's practitioner report on synthesizing Claude Code skills from domain-specific source material (commit history, security patches, OWASP docs) — found 8 real IDORs missed by professional pen testing
- Ingest: Slate: Moving Beyond ReAct and RLM (practitioner-report) — Practitioner report on thread-weaving agent architecture — bounded worker threads return compressed episodes to an orchestrator, solving working memory, strategic coherence, and task decomposition simultaneously; the strongest practitioner convergence evidence for the bounded-context orchestration model to date
- Ingest: Solving a Million-Step LLM Task with Zero Errors (scientific-paper) — MAKER achieves zero errors over one million LLM steps via maximal decomposition into single-step microagents with first-to-ahead-by-k voting and red-flagging — proves O(s ln s) cost scaling when hard per-step oracles exist
- Ingest: Spacebot: AI Agent for Teams and Communities (tool-announcement)
- Ingest: Structured Test-Time Scaling: From Multi-Agent Systems to General Inference Architectures (conceptual-essay) — Formal proof that topology compression, scope isolation, and verification form a causal dependency chain enabling hierarchical MAS to bypass exponential error accumulation — directly grounds the KB's separate treatments of decomposition, scoping, and error correction as a unified principle
- Ingest: SuperARC — Can Complexity and Uncomputability Explain Intelligence? (scientific-paper) — Ingest of SuperARC — AIT-grounded benchmark where frontier LLMs score phi ~0.03 while neuro-symbolic CTM/BDM achieves 1.000 on recursive compression; newer models regress; print-statement-only outputs demonstrate zero algorithmic abstraction
- Ingest: The Anatomy of an Agent Harness (conceptual-essay) — Practitioner taxonomy deriving harness components (filesystem, bash, sandboxes, memory, context management, long-horizon execution) from model limitations — provides the component anatomy that bridges Lopopolo's practice and the cybernetics framing
- Ingest: The Bitter Lesson (conceptual-essay)
- Ingest: The Bug That Shipped (practitioner-report) — 3,700-trial practitioner evidence that coding models can diagnose deployment failures when explicitly probed but rarely surface them in undirected self-review
- Ingest: The File System Is the New Database: How I Built a Personal OS for AI Agents (practitioner-report)
- Ingest: The Flawed Ephemeral Software Hypothesis (conceptual-essay) — Essay distinguishing vibe coding from true software ephemerality, arguing that state, integration, interface stability, and auditability keep important systems anchored to durable artifact stacks.
- Ingest: The Spec Is the New Code — A Guide to Spec Driven Development (practitioner-report) — MercadoLibre engineering lead's practitioner guide to Spec Driven Development — the spec/plan/task/implement cascade as methodology for eliminating agent ambiguity, with ecosystem convergence evidence and maturity-level progression
- Ingest: Toulmin Argument (conceptual-essay) — Pedagogical treatment of Toulmin's six-part argument model — canonical source for the structured-claim type's Evidence/Reasoning/Caveats sections
- Ingest: Towards a Science of AI Agent Reliability (scientific-paper)
- Ingest: Towards a Science of Scaling Agent Systems (scientific-paper)
- Ingest: Trajectory-Informed Memory Generation for Self-Improving Agent Systems (scientific-paper) — IBM pipeline extracts strategy/recovery/optimization tips from agent execution trajectories and injects at runtime — subtask granularity and LLM-guided retrieval drive gains, especially on complex tasks (+14.3 pp SGC); provides a concrete closed learning loop with inspectable output but narrow oracle (AppWorld task completion).
- Ingest: What spec-driven development gets wrong (practitioner-report) — Augment's argument that spec-driven development fails unless agents co-maintain the spec — bidirectional spec as a mechanism for matching maintenance throughput to generation throughput
- Ingest: What Survives in Multi-Agent Systems (conceptual-essay) — Applied bitter-lesson analysis predicting which multi-agent patterns survive stronger models — argues filesystem, forking, and spawning are structural while fixed orchestration is a vision feature
- Ingest: When code is free, research is all that matters (conceptual-essay) — Investor/researcher argument that oracle availability (not capability) determines automation boundary for cognitive work — research taste is unautomatable because problem selection has no ground truth
- Ingest: Why AI systems don't learn and what to do about it (conceptual-essay) — Position paper arguing current AI externalizes learning into human-run MLOps and proposing an A-B-M architecture where meta-control arbitrates observation and action learning for lifelong adaptation.
- Intelligent AI Delegation (academic-paper) — Google DeepMind framework for intelligent AI delegation — proposes adaptive protocols covering task decomposition, multi-objective optimization, trust/reputation, verifiable completion, and security for human-AI and AI-AI delegation networks, with explicit analysis of how MCP, A2A, AP2, and UCP map onto these requirements.
- Language Models, Like Humans, Show Content Effects on Reasoning Tasks (academic-paper)
- Large Language Model Agents Are Not Always Faithful Self-Evolvers (academic-paper) — Causal-intervention paper showing self-evolving agents rely on raw trajectories more faithfully than condensed experience, exposing a compression-faithfulness gap across frameworks, models, and environments
- Lessons from Building AI Agents for Financial Services (x-article)
- Letta (MemGPT): Stateful Agents with Self-Managed Memory
- Mem0: Universal Memory Layer for AI Agents (github-repo)
- Mesa Optimizers and Language Recursion (blog-post) — Speculative blog post connecting mesa optimizers to language recursion by treating both as compressed generative rules that can appear as sudden capability jumps.
- Meta-Harness: End-to-End Optimization of Model Harnesses (academic-paper) — Stanford/MIT paper proposing Meta-Harness, an outer-loop system that uses a coding agent with full filesystem access to prior code and execution traces to automatically search over and optimize LLM harnesses — outperforming hand-engineered baselines on text classification and TerminalBench-2.
- Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead (academic-paper) — Position paper reframing multi-agent memory management through a computer architecture lens — proposes shared vs. distributed memory paradigms, a three-layer hierarchy (I/O, cache, memory), and identifies memory consistency as the most urgent unresolved challenge for scalable multi-agent systems.
- On the "Induction Bias" in Sequence Models (academic-paper)
- OpenClaw-RL: Train Any Agent Simply by Talking (academic-paper) — Framework that converts live next-state signals (user replies, tool outputs, terminal feedback, GUI state) into RL rewards and token-level supervision, enabling a single policy to personalize and improve on agentic tasks simultaneously.
- Post by @deepfates (x-post)
- Post by @koylanai (x-post)
- Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025 (academic-paper)
- Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations (academic-paper)
- Recursive Language Models - what finally gave me the 'aha' moment (x-article)
- Skill Synthesis: Materializing Knowledge as Skills (x-post)
- Solving a Million-Step LLM Task with Zero Errors (academic-paper)
- Spacebot: AI Agent for Teams and Communities (web-page)
- Structured Test-Time Scaling: From Multi-Agent Systems to General Inference Architectures (academic-paper) — Unified theoretical framework explaining how three structural mechanisms (topology compression, scope isolation, verification) enable hierarchical multi-agent systems to bypass exponential error accumulation in test-time scaling.
- The Anatomy of an Agent Harness (x-article)
- The Bitter Lesson (encyclopedia-article)
- The Bug That Shipped (x-article)
- The File System Is the New Database: How I Built a Personal OS for AI Agents (x-article)
- The Flawed Ephemeral Software Hypothesis (blog-post) — Essay arguing AI makes software more malleable, not ephemeral, because validation, state, interface stability, and auditability remain the load-bearing bottlenecks.
- The Spec Is the New Code. A Guide to Spec Driven Development (x-article)
- Thread by @melodyskim (x-thread)
- Toulmin Argument (documentation)
- Towards a Science of AI Agent Reliability (academic-paper)
- Towards a Science of Scaling Agent Systems (academic-paper)
- Trajectory-Informed Memory Generation for Self-Improving Agent Systems (academic-paper) — IBM Research framework that extracts three categories of actionable tips (strategy, recovery, optimization) from agent execution trajectories and injects them at runtime — evaluated on AppWorld showing up to 14.3 pp gains in scenario goal completion.
- What spec-driven development gets wrong (x-article)
- What Survives in Multi-Agent Systems
- When code is free, research is all that matters (x-article)
- Why AI systems don't learn and what to do about it (academic-paper) — Dupoux, LeCun, and Malik argue current AI externalizes learning into human-run MLOps, then propose an A-B-M architecture where observation learning, action learning, and a meta-control plane are integrated for lifelong adaptation.