Sources Directory
Type: kb/types/index.md
← Parent
- "Creative Thinking" (snapshot)
- A-Mem: Agentic Memory for LLM Agents (snapshot)
- A-MEM: Learning Operations Analysis (note) - Dissects A-MEM's four fully-automatic operations (construct, link, evolve, retrieve) — all accretive, none curative — identifying the missing vocabulary (delete, split, reorganize, assess quality) that separates accumulation from curation
- Agent Behavioral Contracts for Reliable Agents (snapshot)
- Agentic Code Reasoning (snapshot) - Semi-formal reasoning templates (explicit premises, execution traces, formal conclusions) improve LLM code verification by 5-12pp across patch equivalence, fault localization, and code QA tasks
- Agentic Memory for LLM Agents (snapshot)
- Agentic Note-Taking 23: Notes Without Reasons (snapshot)
- AI Components for a Deterministic System (An Example) (snapshot)
- Andrej Karpathy talks about "Claws" (snapshot)
- Automated linking improves retrieval but may degrade navigability (note) - Triangulates A-MEM, Notes Without Reasons, and the open-problem note — automated linking improves retrieval (QA benchmarks) but degrades navigability (agent trust in link infrastructure); the distinction is adjacency versus connection
- Autoreason: Self-Refinement That Knows When to Stop (snapshot) - Self-refinement paper that makes "do nothing" a first-class candidate via blind fresh-agent Borda tournaments, finding gains mostly in mid-tier models with a generation-evaluation gap
- Beyond Transformers: Sudoku Bench (snapshot) - Pathway's BDH model achieves 97.4% accuracy on extreme Sudoku while leading LLMs score 0%, using the gap as evidence that transformer architecture has fundamental limits for constraint-satisfaction reasoning and arguing for post-transformer latent-space models.
- Coding Agents are Effective Long-Context Processors (snapshot) - Benchmark paper arguing coding agents process long contexts better by turning text corpora into file-system-native tool workflows rather than latent attention or fixed retrieval
- Cognee: Knowledge Engine for AI Agent Memory (snapshot)
- Components of A Coding Agent (snapshot) - Raschka's breakdown of the six architectural components of a coding agent harness — distinguishing the harness from the model and arguing that context quality drives apparent model quality.
- Context Engineering for AI Agents in Open-Source Software (snapshot)
- Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs (snapshot) - Empirical study defining and measuring Maximum Effective Context Window (MECW) across 11 frontier LLMs — finds MECW is drastically smaller than advertised MCW, shifts by task type, and that large context windows cause hallucination rates to approach 100%.
- Continual Learning in Token Space (snapshot) - Letta reframes continual learning for agents as optimization over learned context rather than weights, arguing token-space memory is the primary transferable substrate for long-lived agents
- ConvexBench: Can LLMs Recognize Convex Functions? (snapshot)
- Dario Amodei — "We are near the end of the exponential" (snapshot) - Anthropic CEO's capability-timeline predictions — verifiable domains get confident timelines, unverifiable ones get hedged, implicitly confirming oracle-strength thesis
- EsoLang-Bench (snapshot) - OOD code benchmark using esoteric languages to separate transferable reasoning from benchmark memorization and contamination
- Evaluating Long-Context Reasoning in LLM-Based WebAgents (snapshot) - Benchmark showing LLM-based web agents fail badly under long context with injected irrelevant task sequences — success rates drop from 40-50% to under 10% at 150k tokens, with loop and lost-objective failures dominating; implicit RAG provides only modest relief.
- Everything you need to know about LLM memory (snapshot) - Notion essay arguing that LLM memory needs retrieval, salience, summarization, forgetting, and memory objects rather than raw chat logs.
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering (snapshot) - Survey paper framing LLM agent progress as externalization into memory, skills, protocols, and harness engineering rather than only stronger model weights.
- From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence (snapshot)
- Graphiti: Temporal Knowledge Graph for AI Agents (snapshot)
- Harness Engineering Is Cybernetics (snapshot)
- Harness Engineering: Leveraging Codex in an Agent-First World (snapshot)
- How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark (snapshot) - Introduces GSM-DC, a controlled benchmark using symbolic DAGs to systematically measure how irrelevant context degrades LLM reasoning — quantifies power-law error scaling with distractor count, and shows Hard-IC training plus PRM-guided tree search are the most effective robustness interventions.
- Improving AI Skills with autoresearch & evals-skills (snapshot)
- Infinite midwit (snapshot) - Adam Mastroianni's objective-vs-subjective intelligence framing for why AI competence and benchmarks still miss taste, wisdom, and idea selection.
- Ingest: "Creative Thinking" (ingest-report) - Shannon's 1952 lecture cataloguing six explicit problem-solving operators (simplification, analogy, restatement, generalization, structural analysis, inversion) as a portable creative toolkit
- Ingest: A-MEM: Agentic Memory for LLM Agents (ingest-report) - Zettelkasten-inspired flat agent memory with embedding linking and LLM-driven evolution — benchmark success without curation operations or inspectable links
- Ingest: Agent Behavioral Contracts for Reliable Agents (ingest-report) - Formal framework (ABC) extending Design-by-Contract to autonomous agents — introduces probabilistic compliance model (p,delta,k), Lyapunov drift bounds, hard/soft constraint separation with typed recovery, and a YAML DSL for specifying behavioral contracts
- Ingest: Agentic Code Reasoning (ingest-report) - Semi-formal reasoning templates (explicit premises, execution traces, formal conclusions) improve LLM code verification by 5-12pp — empirical evidence for structure-as-distribution-selector and interpretation-narrowing with quantified cost (2.8x steps)
- Ingest: Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for LLM Agents (ingest-report) - RL-trained unified LTM/STM memory policy for LLM agents — confirms memory management is learnable when task-completion oracles exist, but operates on opaque weights and low-reach facts
- Ingest: Agentic Note-Taking 23: Notes Without Reasons (ingest-report) - First-person agent testimony that propositional link semantics differ in kind from embedding adjacency, with a Goodhart corruption argument and an unresolved curation-scaling question
- Ingest: AI Components for a Deterministic System (An Example) (ingest-report) - Evans argues that separating modeling (schema creation) from classification (schema application) tames LLM non-determinism — a practitioner case study of constraining via taxonomy freezing
- Ingest: Andrej Karpathy talks about "Claws" (ingest-report) - Willison and Karpathy framing "Claw" as a term of art for local persistent AI-agent systems with scheduling, context, tools, and personal-hardware execution.
- Ingest: Autoreason: Self-Refinement That Knows When to Stop (ingest-report) - Autoreason paper showing self-refinement improves only when candidate synthesis is paired with blind comparative judging and incumbent survival, with gains concentrated in the generation-evaluation gap
- Ingest: Beyond Transformers: Sudoku Bench (ingest-report) - Company blog using Sudoku benchmark (97.4% vs 0% LLM) to argue transformers are fundamentally limited for constraint satisfaction; undisclosed BDH architecture, weak methodology, but adds a third problem domain to the architectural-limits evidence cluster alongside Ebrahimi and ConvexBench
- Ingest: Coding Agents are Effective Long-Context Processors (ingest-report) - Benchmark paper claiming coding agents beat RAG and context scaling on long-context tasks by using filesystem-native search, slicing, and scripting
- Ingest: Cognee: Knowledge Engine for AI Agent Memory (ingest-report) - Pipeline-first knowledge engine with custom Pydantic schemas for LLM entity extraction, poly-store graph+vector design, and an undersized enrichment phase that concretely marks the boundary between automatable extraction and open enrichment problems
- Ingest: Components of A Coding Agent (ingest-report) - Practitioner decomposition of coding agent harnesses into six named components, with the central claim that apparent model quality is really context quality — independent convergent evidence for the KB's context-efficiency thesis.
- Ingest: Context Engineering for AI Agents in Open-Source Software (ingest-report) - First empirical study of AI context files across 466 OSS projects — provides naturalistic data on content categories, five writing styles as constraint strategies, add-then-modify evolution pattern, and 50% stagnation rate that grounds and challenges KB constraining theory
- Ingest: Continual Learning in Token Space (ingest-report) - Letta reframes continual learning as optimizing learned context rather than weights, but the KB's stronger frame is weight space versus repo artifacts, including codified procedures
- Ingest: ConvexBench: Can LLMs Recognize Convex Functions? (ingest-report) - Benchmark proving LLM compositional reasoning collapses with depth (not token count), recovered by recursive decomposition with focused context — quantitative evidence for scheduling model predictions
- Ingest: Dario Amodei — "We are near the end of the exponential" (ingest-report) - Anthropic CEO's capability-timeline predictions implicitly confirm oracle-strength thesis — verifiable domains (coding, math) get confident timelines while unverifiable domains (novel writing, science) get hedged ones
- Ingest: EsoLang-Bench (ingest-report) - Esoteric-language code benchmark arguing standard coding scores mostly measure pretraining fit, with interpreter feedback beating textual critique on OOD tasks
- Ingest: Evaluating Long-Context Reasoning in LLM-Based WebAgents (ingest-report) - Ingest of NeurIPS 25 workshop paper benchmarking LLM web agents under long context (25k-150k tokens) with injected irrelevant task sequences — provides agent-level empirical evidence for soft degradation, loop entrapment, and objective loss, extending GSM-DC's distractor findings to multi-session agentic tasks.
- Ingest: Everything you need to know about LLM memory (ingest-report) - Rosebud Journal memory essay reframing LLM memory as a policy stack over raw/derived artifacts, retrieval timing, curation, and forgetting propagation
- Ingest: Externalization in LLM Agents (ingest-report) - Survey paper unifying LLM agent memory, skills, protocols, and harness engineering as externalized cognitive infrastructure rather than model-weight capability alone
- Ingest: From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence (ingest-report) - Epiplexity paper formalizing extractable structure for computationally bounded observers, useful for observer-relative information value and context-efficiency theory.
- Ingest: Graphiti: Temporal Knowledge Graph for AI Agents (ingest-report) - Graph-first agent memory with bi-temporal edge invalidation — the strongest counterexample to files-first architecture in the surveyed memory systems
- Ingest: Harness Engineering Is Cybernetics (ingest-report) - Conceptual thread framing harness engineering as cybernetic feedback-loop design: sensors, actuators, constraints, and externalized judgment.
- Ingest: Harness Engineering: Leveraging Codex in an Agent-First World (ingest-report) - Practitioner report on 1M LOC fully agent-generated codebase — harness engineering as constrain/inform/verify/correct, entropy management via background cleanup agents, error messages as dual-function constraining
- Ingest: How Is LLM Reasoning Distracted by Irrelevant Context? (ingest-report) - Controlled benchmark quantifying how irrelevant context degrades LLM reasoning via power-law error scaling with distractor count — strongest empirical grounding for the soft-degradation thesis in this KB; training and inference-time mitigations tested.
- Ingest: Improving AI Skills with autoresearch & evals-skills (ingest-report) - Three-take Auto Research field report where optimization only worked after manual error analysis, failure taxonomy design, and judge calibration across the Three Gulfs.
- Ingest: Infinite midwit (ingest-report) - Objective-vs-subjective intelligence essay arguing that AI's real bottleneck is taste and boringness judgment, not benchmarked competence
- Ingest: Intelligent AI Delegation (ingest-report) - Google DeepMind delegation framework centers verifiability, liability, trust, and 11 task axes in agent delegation; notable for accountability vacuum and liability firebreaks in long chains
- Ingest: Into the Unknown: Self-Learning Large Language Models (ingest-report) - Hallucination-driven self-learning LLM paper proposing Points in the Unknown, a self-question/search/train loop, and metrics for selecting models that can discover factual knowledge gaps
- Ingest: Language Models, Like Humans, Show Content Effects on Reasoning Tasks (ingest-report) - Empirical demonstration that LLMs mirror human content effects on reasoning (syllogisms, NLI, Wason) — content bias survives scaling and instruction tuning but chain-of-thought partially restores content-independent reasoning
- Ingest: Large Language Model Agents Are Not Always Faithful Self-Evolvers (ingest-report) - Causal-intervention paper showing compressed agent memories can improve systems yet fail faithfulness tests, making behavioral dependence the missing metric for self-evolving agents
- Ingest: Lessons from Building AI Agents for Financial Services (ingest-report) - Production practitioner report on building AI agents for financial services — validates files-not-database at commercial scale (S3-first with derived PostgreSQL), documents skill shadowing as user-customization mechanism, and articulates "model eats scaffolding" as an explicit design principle with fiscal-period normalization as calculator-regime counterexample
- Ingest: Letta (MemGPT): Stateful Agents with Self-Managed Memory (ingest-report) - Agent memory platform where the LLM self-manages a three-tier memory hierarchy (core/recall/archival) using an OS analogy — the strongest existing exemplar of the agent-self-managed agency model, now evolving toward git-backed memory files
- Ingest: LLM Knowledge Bases (ingest-report) - Karpathy on agent-maintained research wikis in Obsidian — index files and brief summaries replacing fancy RAG at roughly 100-article scale
- Ingest: LLM Wiki (ingest-report) - Karpathy's long-form agent-maintained wiki manifesto — explicit raw/wiki/schema architecture plus index/log separation beyond his earlier X-post workflow sketch
- Ingest: Maximum Effective Context Window (ingest-report) - Empirical study measuring Maximum Effective Context Window (MECW) across 11 frontier LLMs — finds MECW is up to 99% smaller than advertised MCW, varies by task type, and that exceeding MECW drives hallucination rates toward 100%; directly grounds the KB's bounded-context theory with multi-model dose-response data
- Ingest: Mem0: Universal Memory Layer for AI Agents (ingest-report) - Mem0's two-phase add pipeline (extract facts + LLM-judged CRUD reconciliation) is the purest production example of automated accretion-without-synthesis — now contextualized against eleven systems in the comparative review
- Ingest: Memory Intelligence Agent (ingest-report) - MIA mixed-substrate deep-research agent memory paper — search trajectories become both workflow memory and Planner weight updates during test-time learning
- Ingest: Memory Scaling for AI Agents (ingest-report) - Databricks memory-scaling experiments showing enterprise agent gains from external memory only when retrieval, distillation, and governance scale with the store
- Ingest: Mesa Optimizers and Language Recursion (ingest-report) - Speculative essay arguing mesa optimizers may emerge suddenly because language recursion and learned search both compress many cases into reusable generative rules.
- Ingest: Meta-Harness: End-to-End Optimization of Model Harnesses (ingest-report) - Controlled ablation showing raw execution traces (10 MTok/iter) outperform summaries by 10+ points in automated harness search — first empirical evidence for diagnostic richness as binding constraint
- Ingest: Minimum Viable Ontology / Domain Maps (ingest-report) - Tweet thread proposing "minimum viable ontology" — the smallest term list to orient a newcomer in a domain — with a vibecoded prototype (domainmaps.co) and pedagogical framing via "conceptual thresholds"
- Ingest: Multi-Agent Memory from a Computer Architecture Perspective (ingest-report) - Computer-architecture analogy for multi-agent memory — shared/distributed paradigms, three-layer hierarchy, consistency protocols as the critical unsolved problem
- Ingest: Natural-Language Agent Harnesses (ingest-report) - NLAH paper externalizes agent control logic as portable natural-language artifacts — key empirical finding: explicit structure helps only when it tightens alignment with evaluator acceptance criteria, not by adding process layers
- Ingest: Novel Memory Forgetting Techniques for Autonomous AI Agents (ingest-report) - Formula-based adaptive forgetting with constrained optimization for agent memory — the inspectable alternative to RL-trained memory policy, with empirical evidence that uncontrolled accumulation causes false memory propagation
- Ingest: On the "Induction Bias" in Sequence Models (ingest-report) - 190k-run empirical study showing transformers need orders-of-magnitude more data than RNNs for state tracking due to absence of step-by-step induction bias; introduces sharing factor kappa quantifying cross-length mechanism reuse
- Ingest: OpenClaw-RL: Train Any Agent Simply by Talking (ingest-report) - RL framework that trains agents from live next-state signals (user replies, tool outputs, terminal feedback, GUI state) during deployment — collapses the training/deployment boundary and challenges the KB's three-timescale model by performing weight updates from interactions the agent is already having.
- Ingest: Post by @deepfates — LLM "memory" as context stuffing (ingest-report) - Deepfates argues LLM "memory" is just context-stuffing that creates false salience (Chekov's gun), advocates agentic context-building, but concludes weight updates are necessary — directly contradicts this KB's durability-not-weights position
- Ingest: Post by @koylanai (ingest-report) - Argues that pairwise judging plus round-robin win rates is a better evaluation primitive than absolute scoring for open-ended LLM tasks with no hard ground truth
- Ingest: Professional Software Developers Don't Vibe, They Control (ingest-report) - Empirical study (N=112) finding experienced developers control AI agents through SE practices, not vibe coding -- grounds constraining, underspecification, and programming-practices-transfer arguments
- Ingest: Prompt Stability in Code LLMs (ingest-report) - Empirical study measuring code LLM stability under emotion/personality prompt variations — finds performance and stability are decoupled objectives, smaller models can be more stable, and emotional prompting reveals confidence miscalibration invisible to standard benchmarks
- Ingest: Psychology already solved AI memory — identity isn't stored, it's constructed (ingest-report) - Thread proposing five psychology principles (Conway, Damasio, Bruner, Klein & Nichols) for AI memory as identity construction — directly engages the KB's open question about whether cognitive science analogies are decorative or mechanistic
- Ingest: Recursive Language Models - what finally gave me the 'aha' moment (ingest-report) - Detailed practitioner walkthrough of RLM architecture via five-architecture comparison (direct gen, RAG, ReAct, CodeAct, CodeAct+subagents, RLM) — the most concrete evidence for REPL-as-substrate, symbolic variable return, and scaffold-level truncation in the KB
- Ingest: Scaling Managed Agents: Decoupling the brain from the hands (ingest-report) - Anthropic Managed Agents report showing brain/hand/session interface decomposition, durable session logs, and stale harness assumptions as model capability changes
- Ingest: Self-training Large Language Models through Knowledge Detection (ingest-report) - EMNLP paper turning unknown-detection scores into filtered DPO preference data, with selective self-training reducing hallucination and limiting forgetting on Wikipedia QA
- Ingest: Skill Synthesis — Materializing Knowledge as Skills (ingest-report) - Sentry co-founder's practitioner report on synthesizing Claude Code skills from domain-specific source material (commit history, security patches, OWASP docs) — found 8 real IDORs missed by professional pen testing
- Ingest: Slate: Moving Beyond ReAct and RLM (ingest-report) - Practitioner report on thread-weaving agent architecture — bounded worker threads return compressed episodes to an orchestrator, solving working memory, strategic coherence, and task decomposition simultaneously; the strongest practitioner convergence evidence for the bounded-context orchestration model to date
- Ingest: Solving a Million-Step LLM Task with Zero Errors (ingest-report) - MAKER achieves zero errors over one million LLM steps via maximal decomposition into single-step microagents with first-to-ahead-by-k voting and red-flagging — proves O(s ln s) cost scaling when hard per-step oracles exist
- Ingest: Spacebot: AI Agent for Teams and Communities (ingest-report) - Spacebot README ingest covering process-typed concurrent agent runtime architecture, branch scoping, cortex supervision, and typed unified memory
- Ingest: Structured Test-Time Scaling: From Multi-Agent Systems to General Inference Architectures (ingest-report) - Formal proof that topology compression, scope isolation, and verification form a causal dependency chain enabling hierarchical MAS to bypass exponential error accumulation — directly grounds the KB's separate treatments of decomposition, scoping, and error correction as a unified principle
- Ingest: SuperARC — Can Complexity and Uncomputability Explain Intelligence? (ingest-report) - Ingest of SuperARC — AIT-grounded benchmark where frontier LLMs score phi ~0.03 while neuro-symbolic CTM/BDM achieves 1.000 on recursive compression; newer models regress; print-statement-only outputs demonstrate zero algorithmic abstraction
- Ingest: The "Mismanaged Geniuses" Hypothesis (ingest-report) - Hypothesis that current frontier LMs are bottlenecked by learned decomposition/scaffold policy rather than base capability, using RLMs and orchestrator-subagent systems as evidence
- Ingest: The Anatomy of an Agent Harness (ingest-report) - Practitioner taxonomy deriving harness components (filesystem, bash, sandboxes, memory, context management, long-horizon execution) from model limitations — provides the component anatomy that bridges Lopopolo's practice and the cybernetics framing
- Ingest: The Bitter Lesson (ingest-report) - Wikipedia-contextualized capture of Sutton's Bitter Lesson, useful for scaling arguments and caveats about general methods versus hand-coded knowledge.
- Ingest: The Bug That Shipped (ingest-report) - 3,700-trial practitioner evidence that coding models can diagnose deployment failures when explicitly probed but rarely surface them in undirected self-review
- Ingest: The File System Is the New Database: How I Built a Personal OS for AI Agents (ingest-report) - Practitioner report on a file-based personal OS for AI agents, useful as self-reported evidence for filesystem-first context engineering.
- Ingest: The Flawed Ephemeral Software Hypothesis (ingest-report) - Essay distinguishing vibe coding from true software ephemerality, arguing that state, integration, interface stability, and auditability keep important systems anchored to durable artifact stacks.
- Ingest: The Geometry of Forgetting (ingest-report) - Embedding-memory paper arguing that interference and low effective dimensionality, not time decay, drive forgetting and false recall in similarity retrieval.
- Ingest: The Price of Meaning: Why Every Semantic Memory System Forgets (ingest-report) - Formal no-escape theorem for semantic memory interference, with exact-record and symbolic-verifier escape clauses that sharpen retrieval-vs-verification tradeoffs.
- Ingest: The Second Brain Trap (ingest-report) - PlugLab AI founder reframes "second brain" failure as stored knowledge that never activates in context, then proposes trigger-rich graph structure as the fix
- Ingest: The Spec Is the New Code — A Guide to Spec Driven Development (ingest-report) - MercadoLibre engineering lead's practitioner guide to Spec Driven Development — the spec/plan/task/implement cascade as methodology for eliminating agent ambiguity, with ecosystem convergence evidence and maturity-level progression
- Ingest: Toulmin Argument (ingest-report) - Pedagogical treatment of Toulmin's six-part argument model — canonical source for the structured-claim type's Evidence/Reasoning/Caveats sections
- Ingest: Towards a Science of AI Agent Reliability (ingest-report) - Reliability framework paper arguing mean task success is inadequate for agents, replacing it with consistency, robustness, predictability, and safety.
- Ingest: Towards a Science of Scaling Agent Systems (ingest-report) - Controlled multi-agent scaling paper showing coordination gains depend on task decomposability, verification, and context overhead rather than agent count.
- Ingest: tracecraft (ingest-report) - S3-backed CLI coordination tool for multi-agent systems — exemplifies coordination-without-guarantees and the files-over-database bet applied to inter-agent state rather than knowledge storage
- Ingest: Trajectory-Informed Memory Generation for Self-Improving Agent Systems (ingest-report) - IBM pipeline extracts strategy/recovery/optimization tips from agent execution trajectories and injects at runtime — subtask granularity and LLM-guided retrieval drive gains, especially on complex tasks (+14.3 pp SGC); provides a concrete closed learning loop with inspectable output but narrow oracle (AppWorld task completion).
- Ingest: Transformers Learn In-Context by Gradient Descent (ingest-report) - Mechanistic ICML paper showing in-context regression can be implemented as gradient descent inside Transformer forward passes, sharpening the internal half of the KB's in-context-learning theory
- Ingest: What spec-driven development gets wrong (ingest-report) - Augment's argument that spec-driven development fails unless agents co-maintain the spec — bidirectional spec as a mechanism for matching maintenance throughput to generation throughput
- Ingest: What Survives in Multi-Agent Systems (ingest-report) - Applied bitter-lesson analysis predicting which multi-agent patterns survive stronger models — argues filesystem, forking, and spawning are structural while fixed orchestration is a vision feature
- Ingest: When code is free, research is all that matters (ingest-report) - Investor/researcher argument that oracle availability (not capability) determines automation boundary for cognitive work — research taste is unautomatable because problem selection has no ground truth
- Ingest: Why AI systems don't learn and what to do about it (ingest-report) - Position paper arguing current AI externalizes learning into human-run MLOps and proposing an A-B-M architecture where meta-control arbitrates observation and action learning for lifelong adaptation.
- Intelligent AI Delegation (snapshot) - Google DeepMind framework for intelligent AI delegation — proposes adaptive protocols covering task decomposition, multi-objective optimization, trust/reputation, verifiable completion, and security for human-AI and AI-AI delegation networks, with explicit analysis of how MCP, A2A, AP2, and UCP map onto these requirements.
- Into the Unknown: Self-Learning Large Language Models (snapshot) - Self-learning LLM paper proposing Points in the Unknown, hallucination-based unknown detection, self-questioning/search/training loop, and self-learning capability metrics
- Language Models, Like Humans, Show Content Effects on Reasoning Tasks (snapshot)
- Large Language Model Agents Are Not Always Faithful Self-Evolvers (snapshot) - Causal-intervention paper showing self-evolving agents rely on raw trajectories more faithfully than condensed experience, exposing a compression-faithfulness gap across frameworks, models, and environments
- Lessons from Building AI Agents for Financial Services (snapshot)
- Letta (MemGPT): Stateful Agents with Self-Managed Memory
- LLM Wiki (snapshot) - Karpathy's idea file for agent-maintained personal wikis, centered on a persistent markdown layer between raw sources and query-time chat
- Mem0: Universal Memory Layer for AI Agents (snapshot)
- Memory Intelligence Agent (snapshot) - MIA paper on converting deep-research search trajectories into workflow memory and Planner test-time training
- Memory Scaling for AI Agents (snapshot) - Databricks AI Research argument and experiments for external-memory scaling as a third agent improvement axis alongside model and inference scaling
- Mesa Optimizers and Language Recursion (snapshot) - Speculative blog post connecting mesa optimizers to language recursion by treating both as compressed generative rules that can appear as sudden capability jumps.
- Meta-Harness: End-to-End Optimization of Model Harnesses (snapshot) - Stanford/MIT paper proposing Meta-Harness, an outer-loop system that uses a coding agent with full filesystem access to prior code and execution traces to automatically search over and optimize LLM harnesses — outperforming hand-engineered baselines on text classification and TerminalBench-2.
- Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead (snapshot) - Position paper reframing multi-agent memory management through a computer architecture lens — proposes shared vs. distributed memory paradigms, a three-layer hierarchy (I/O, cache, memory), and identifies memory consistency as the most urgent unresolved challenge for scalable multi-agent systems.
- Natural-Language Agent Harnesses (snapshot) - Proposes externalizing agent control logic (contracts, roles, stages, failure taxonomy) as portable natural-language artifacts (NLAHs) with an Intelligent Harness Runtime, evaluated on SWE-bench and OSWorld — key finding: explicit structure helps only when it tightens alignment with evaluator acceptance criteria.
- Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency (snapshot) - Adaptive budgeted forgetting framework for long-horizon conversational agents — relevance scoring (recency, frequency, semantic alignment) plus constrained optimization to prune memory while reducing false memory propagation.
- On the "Induction Bias" in Sequence Models (snapshot)
- OpenClaw-RL: Train Any Agent Simply by Talking (snapshot) - Framework that converts live next-state signals (user replies, tool outputs, terminal feedback, GUI state) into RL rewards and token-level supervision, enabling a single policy to personalize and improve on agentic tasks simultaneously.
- Post by @deepfates (snapshot)
- Post by @karpathy (snapshot)
- Post by @koylanai (snapshot)
- Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025 (snapshot)
- Prompt Stability in Code LLMs (snapshot)
- Psychology already solved AI memory — identity isn't stored, it's constructed (snapshot) - Thread arguing AI memory should adopt psychology's model of identity construction through autobiographical memory, citing Conway, Damasio, Bruner, and Klein & Nichols
- Recursive Language Models - what finally gave me the 'aha' moment (snapshot)
- Scaling Managed Agents: Decoupling the brain from the hands (snapshot) - Anthropic's Managed Agents architecture argues for stable brain, hand, and session interfaces that outlast changing agent harness implementations.
- Self-training Large Language Models through Knowledge Detection (snapshot) - EMNLP 2024 paper on self-training LLMs by filtering DPO preference data to unknown samples using reference-free contradiction scores
- Skill Synthesis: Materializing Knowledge as Skills (snapshot)
- Solving a Million-Step LLM Task with Zero Errors (snapshot)
- Spacebot: AI Agent for Teams and Communities (snapshot)
- Structured Test-Time Scaling: From Multi-Agent Systems to General Inference Architectures (snapshot) - Unified theoretical framework explaining how three structural mechanisms (topology compression, scope isolation, verification) enable hierarchical multi-agent systems to bypass exponential error accumulation in test-time scaling.
- SuperARC: Recursive Compression Benchmark (snapshot) - Introduces SuperARC, an AIT-grounded benchmark showing frontier LLMs score near zero on recursive compression tasks and newer versions often regress, while neuro-symbolic CTM/BDM methods achieve perfect scores — evidence that statistical pattern matching differs fundamentally from algorithmic abstraction.
- The "Mismanaged Geniuses" Hypothesis (snapshot)
- The Anatomy of an Agent Harness (snapshot)
- The Bitter Lesson (snapshot)
- The Bug That Shipped (snapshot)
- The File System Is the New Database: How I Built a Personal OS for AI Agents (snapshot)
- The Flawed Ephemeral Software Hypothesis (snapshot) - Essay arguing AI makes software more malleable, not ephemeral, because validation, state, interface stability, and auditability remain the load-bearing bottlenecks.
- The Geometry of Forgetting (snapshot) - Embedding-space account of human-like forgetting and false memory — interference, low effective dimensionality, and semantic clustering reproduce classic memory effects.
- The Price of Meaning: Why Every Semantic Memory System Forgets (snapshot) - Formal no-escape theorem paper arguing semantic memory systems face interference-driven forgetting and false recall under finite effective dimensionality.
- The Second Brain Trap (snapshot) - PlugLab AI article arguing that second-brain systems fail when stored notes never activate in working context
- The Spec Is the New Code. A Guide to Spec Driven Development (snapshot)
- Thread by @melodyskim (snapshot)
- Toulmin Argument (snapshot)
- Towards a Science of AI Agent Reliability (snapshot)
- Towards a Science of Scaling Agent Systems (snapshot)
- tracecraft (snapshot) - S3-backed CLI coordination layer for multi-agent AI systems — shared memory, messaging, task claiming, and barriers stored as JSON files in any S3-compatible bucket, with no servers or databases required.
- Trajectory-Informed Memory Generation for Self-Improving Agent Systems (snapshot) - IBM Research framework that extracts three categories of actionable tips (strategy, recovery, optimization) from agent execution trajectories and injects them at runtime — evaluated on AppWorld showing up to 14.3 pp gains in scenario goal completion.
- Transformers Learn In-Context by Gradient Descent (snapshot) - Mechanistic ICML paper showing linear self-attention can implement gradient descent for in-context regression and trained Transformers can recover that construction
- What spec-driven development gets wrong (snapshot)
- What Survives in Multi-Agent Systems
- When code is free, research is all that matters (snapshot)
- Why AI systems don't learn and what to do about it (snapshot) - Dupoux, LeCun, and Malik argue current AI externalizes learning into human-run MLOps, then propose an A-B-M architecture where observation learning, action learning, and a meta-control plane are integrated for lifelong adaptation.