Incremental Self-Improvement

Type: ../types/agent-memory-system-review.md · Tags: trace-learning

Incremental Self-Improvement is Jürgen Schmidhuber's 1995 paradigm for a lifelong reinforcement-learning system that changes how it learns by executing self-modification programs. Coverage here is doc-grounded: the cited technical report describes a concrete toy implementation and experiments, but no reachable repository has been inspected, so the mechanisms below are reported from the paper and local ingest rather than code-grounded findings.

Source: On Learning How to Learn Learning Strategies, revised FKI-198-94 technical report snapshot captured 2026-04-29 from https://people.idsia.ch/~juergen/fki198-94.pdf.

Reviewed version: revised January 31, 1995 technical report; local ingest Ingest: On Learning How to Learn Learning Strategies, ingested 2026-04-29.

Core Ideas

Self-modification is ordinary action. The reported system runs an assembler-like Turing-equivalent language where action subsequences can either interact with the environment or modify the probability distribution over future instruction sequences, including future self-modification programs.
The learned state is a policy distribution. The durable behavior-shaping object is not a note, memory record, or retrieved document; it is the context-dependent probability distribution over future program-cell contents and arguments.
Reward-gated retention replaces manual promotion. A fixed top-level credit-assignment strategy keeps probability changes made by self-modification programs only while their payoff-per-time exceeds the system baseline or the preceding useful self-modification; otherwise old distributions are restored from a stack.
Whole-life accounting charges learning to action. The paper treats system life as one non-resettable sequence and includes learning, evaluation, and rollback overhead in payoff-per-time, so a learning strategy must earn back the time it consumes.
Context efficiency is policy-side, not prompt-side. There is no LLM context window or retrieval budget. The closest reported analogue is that future behavior is influenced through compact probability values rather than loading accumulated traces, but this buys economy by making the retained state opaque and non-citable.

Artifact analysis

Claim-level (no code inspected):

Storage substrate: in-memory — the reported implementation stores work cells, program cells, probability distributions, and a rollback stack inside finite runtime storage for the system's life; no external persistent repository, database, or file-backed memory is described.
Representational form: parametric — the central retained artifact is a numeric probability distribution over instruction and argument values. The paper also describes symbolic primitives and an unmodifiable top-level policy, but the learned operative part is parametric.
Lineage — trace-extracted: probability changes are produced by self-modification programs during system life and retained or rolled back based on subsequent action/reward/time history. The primitive set and payoff function are authored prior bias; the retained policy distribution is derived from lived trajectories under that bias.
Behavioral authority — the probability distribution is a system-definition artifact: it directly changes future action sampling and therefore the agent's behavior. Stack entries have rollback/audit authority over that policy state, not knowledge-reference authority.

The reported promotion path is reward-gated rather than review-gated: candidate self-modifications become retained policy state when they improve measured payoff-per-time, and they can later be demoted by restoring older distributions. This is a strong promotion mechanism only where the reward oracle is strong enough to carry both validity and value.

Comparison with Our System

Commonplace promotes knowledge into inspectable files, links, type specs, and validation rules; Incremental Self-Improvement promotes policy changes into a runtime distribution. Commonplace's retained artifacts can be searched, cited, reviewed, and retired by provenance. Schmidhuber's retained state can steer behavior continuously, but the learned distribution is not readable knowledge and does not explain itself.

The useful comparison is promotion discipline. Schmidhuber collapses "is this true/useful?" into payoff-per-time: if a self-modification improves reward rate enough, it stays. Commonplace separates those questions because KB artifacts often have delayed, ambiguous payoff: a claim may be valid but low-value, promising but under-grounded, or useful only for a narrow future task.

Borrowable Ideas

Charge learning to the same budget as acting. Ready to borrow as a design principle. Review, distillation, connection, and validation workflows should account for the time and context they consume, not only the apparent quality of the produced artifact.
Make promotion reversible. Ready to borrow conceptually. The rollback stack is a useful analogue for preserving enough prior state and lineage to retire or supersede a bad rule without pretending it was never learned.
Treat primitives and payoff as prior bias. Useful as framing, not an implementation task. In Commonplace, type specs, validators, link labels, and skills are the primitive set that makes some learning trajectories cheap and others unlikely.
Do not borrow scalar promotion without the oracle. Needs a strong use case. Payoff-per-time works cleanly only when reward is timely and meaningful; KB promotion needs separate source-fidelity, reasoning-value, and maintenance-cost gates.

Write side

Write agency: automatic — self-modification programs change the retained policy distribution during system life, and the fixed top-level strategy retains or rolls those changes back from observed payoff history.

Curation operations: evolve invalidate promote — self-modification evolves the existing policy distribution, the rollback stack invalidates changes that no longer qualify, and payoff-per-time gates promote candidate modifications into retained policy state.

Trace-learning

The paper warrants trace-learning placement because it reports durable policy updates derived from system-life action/reward history.

Trace source — a lifelong sequence of instructions, self-modification programs, environmental interactions, reward events, and elapsed time.
Extraction — self-modification programs propose probability changes; the unmodifiable top-level strategy uses payoff-per-time as the retention oracle and restores old distributions when a modification stops qualifying.
Scope and timing — online, within a single system life; no resettable episode boundary is assumed, and the reported experiments use toy environments to illustrate operation rather than broad empirical scaling.
Survey placement — a historical trajectory-to-policy case for the trace-learning survey, lower-confidence for modern agent-memory comparison because the evidence is a paper snapshot, not inspectable source.

Read-back

Read-back: push — retained probability changes automatically shape future instruction sampling whenever the system acts; no agent-initiated lookup is required. This is not an engineered relevance-gated activation path in the Commonplace sense, so there is no push-activation tag.

Curiosity Pass

The design is explicit that apparent usefulness may be a fluke; the top-level gate preserves modifications by observed payoff history, not by proving causal responsibility.
The paper's most transferable mechanism may be the rollback discipline, not the self-modifying language.
A simpler modern analogue would be validation-gated skill or harness editing: retain behavior changes only when they beat the incumbent under a strong evaluator, but keep the edited artifact readable.

What to Watch

A reachable implementation of this specific paradigm. If one appears and is inspected, this should promote to agent-memory-system-review and the four-field claims should be verified against code.
Modern Gödel-machine or self-improving harness systems that make reward-gated promotion operational for inspectable agent artifacts; that would test whether the rollback/promotion idea transfers beyond parametric policy state.
Any attempt to replace payoff-per-time with weaker semantic review or task-completion proxies, because that boundary determines whether the mechanism can inform KB promotion rather than only benchmarked agents.

Relevant Notes

trace-learning techniques in related systems — compares-with: places this paper as a lower-confidence historical trajectory-to-policy case
Self-Training-LLM — compares-with: both promote experience into model-side behavior rather than a readable KB artifact, but Self-Training-LLM is repo-inspected and dataset-driven
Meta-Harness — compares-with: a modern code-inspected system where benchmark-scored runs select improved harness variants rather than self-modifying instruction probabilities
choosing what to learn requires both validity and learning value gates — rests-on: explains why KB promotion cannot collapse correctness and value into one payoff-per-time gate
oracle strength spectrum — rests-on: reward-gated self-modification only works cleanly where the evaluation signal can carry the promotion decision
Ingest: On Learning How to Learn Learning Strategies — evidenced-by: local ingest classifies the report and extracts the payoff-per-time, reversible-promotion, and oracle-dependence lessons

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search