HyperAgents

Type: note · Status: current · Tags: related-systems

HyperAgents is a Meta research codebase for open-ended agent self-improvement through self-editing. The checked-in system runs a thin task agent and an even thinner meta agent over a fixed set of benchmark domains, then stores each generation as patch files plus evaluation artifacts inside an archive. It is closer to an evolutionary experiment harness than to a memory system: knowledge persists as git diffs, scores, and lineage metadata, not as curated notes, semantic links, or explicit lessons.

Repository: https://github.com/facebookresearch/Hyperagents

Core Ideas

Outer-loop evolution, inner-loop minimalism. The real self-improvement logic lives mostly in generate_loop.py, not meta_agent.py. MetaAgent.forward() issues only a broad instruction to modify the repo, using the repo's bash and editor tools. Selection, evaluation, archive updates, staged evaluation, and cleanup all happen in the outer harness. The system's intelligence is therefore distributed asymmetrically: the meta agent mutates, but the harness decides which mutations survive.

Git diffs are the learning substrate. Each generation writes model_patch.diff, metadata.json, and evaluation outputs. Later generations replay ancestor diffs into a fresh Docker container with apply_diffs_container(). This is durable symbolic learning through executable artifacts: the system does not update weights and does not store prose memories, but it does accumulate a versioned lineage of code mutations that can be replayed, tested, and branched from.

Verification is benchmark-gated and mostly external. HyperAgents works because the downstream tasks come with concrete evaluation harnesses. The repo checks that edited agents still import, runs staged evaluation before full evaluation, and uses downstream evaluation to decide whether a descendant is worth keeping, even though the current valid_parent condition is looser than the surrounding framing suggests. This is still a hard-oracle-heavy loop. The "self-improvement" comes less from introspection than from being embedded in an environment that can cheaply reject bad descendants.

Archive search replaces reflection memory. Parent selection in utils/gl_utils.py samples from scored ancestors, optionally penalizing nodes that already have many children. The repo also supports an archive-level "ensemble" mode that reuses predictions from previously evaluated generations. HyperAgents remembers by branching from scored ancestors rather than by retrieving distilled rules or textual reflections into prompt context.

Task generality is architecture-wide but harness-local. The README claims the system can optimize "any computable task," but the checked-in implementation supports a fixed allowlist of domains with bespoke data loaders, scoring keys, reports, and sometimes domain-specific harnesses. The architecture is portable in principle, but actual onboarding still requires hand-built evaluation and input formatting code.

Comparison with Our System

Dimension HyperAgents Commonplace
Primary objective Improve an agent codebase over benchmark generations Build and maintain a navigable knowledge base for agents and humans
Persistent substrate Git diffs, generation metadata, eval reports, archived predictions Markdown notes, frontmatter, semantic links, curated indexes
Mutation unit Code patches to meta_agent.py, task_agent.py, helpers, or selection logic Notes, indexes, instructions, and occasionally scripts
Learning signal External benchmark scores and import checks Structural validation, semantic review, human judgment, traversal usefulness
Memory model Scored ancestor archive; replay lineage into fresh containers Library/workshop split with durable notes and explicit relationships
Retrieval/navigation Parent selection over scored generations Search, descriptions, indexes, and link-following
Verification strength Strong where task harnesses exist Mixed; many KB mutations remain judgment-heavy and weakly verifiable

HyperAgents is stronger than Commonplace at one thing we still mostly theorize about: fully automated improvement loops over durable executable artifacts when the task family has a cheap evaluator. Commonplace is stronger at knowledge shape, navigation, and meaning-preserving maintenance. HyperAgents treats improvement as search over program variants; Commonplace treats improvement as curation over linked documents and procedures.

That difference matters. HyperAgents can afford aggressive autonomous mutation because its judged outputs are benchmark reports and import checks. We cannot transfer that autonomy level directly into KB maintenance because most important KB mutations do not have comparably hard oracles. In our terms, HyperAgents lives on the easier side of the boundary of automation: benchmark tasks with explicit rewards.

Borrowable Ideas

Staged evaluation before expensive full runs. Ready to borrow now. generate_loop.py first runs smaller staged evaluation and only pays for broader evaluation if the candidate clears that gate. This is directly applicable to costly maintenance or search loops in our system: cheap screens first, expensive judgment later.

Patch lineage as compact provenance for autonomous mutations. Needs a use case first. HyperAgents stores improvement history as replayable diffs rather than full transcripts. For any future automated KB mutation loop, a diff-first provenance layer would be much easier to audit and branch from than raw conversation logs.

Replay candidate lineages in clean containers. Ready to borrow now as a pattern. The combination of fresh container setup, ancestor replay, compilation checks, and cleanup is a strong operational boundary for self-modifying systems. If we automate risky code-side maintenance, this isolation pattern is better than mutating the live workspace directly.

Keep multiple scored ancestors alive instead of promoting one canonical latest state. Needs a use case first. The archive model preserves alternative lines of improvement and revisits them later. That could matter for competing prompt policies, retrieval policies, or workshop heuristics where local maxima are a real risk.

Curiosity Pass

"Self-improving" mostly means externally selected code mutation, not explicit self-analysis. run_meta_agent.py passes evals_folder and iterations_left into MetaAgent.forward(), but meta_agent.py does not use either value in the prompt it sends to the model. The system can still improve, because the outer harness evaluates descendants and keeps the good ones, but the learning signal is largely outside the in-repo agent logic. The impressive part is the evolutionary scaffold, not the reflective sophistication of the checked-in meta agent.

The archive is lineage memory, not semantic memory. The property it produces is replayable accumulation: later generations can start from earlier code changes. But the representation does not transform experience into explicit abstractions. A human or model must still read diffs and logs to infer why a change worked. Relative to systems like ReasoningBank or ExpeL, HyperAgents is stronger on executable persistence and weaker on explicit lesson extraction.

The "ensemble" mechanism is archive exploitation, not a real ensemble. ensemble.py picks the single highest-scoring generation in the archive and reuses that agent's stored prediction for a task. The mechanism does something useful, but the claimed property is narrower than the label suggests. This is best-agent lookup over historical runs, not voting, synthesis, or per-task model combination.

The editable parent-selection story is ahead of the checked-in implementation. When generate_loop.py allows editing parent selection, get_readme_description() tells the agent to pursue a diversity-preserving stepping-stone strategy. But the checked-in select_next_parent.py currently ignores scores and child counts and returns a uniform random valid parent. The repo clearly knows what the stronger mechanism should be; the actual implementation is still placeholder-simple.

"Any computable task" overclaims what the current code achieves. Even if the evolutionary scaffold works perfectly, the ceiling is bounded by benchmark engineering. Every supported domain still needs custom loaders, formatting functions, score extraction, and reporting logic. HyperAgents is a general pattern for evaluator-rich domains, not a universal optimizer that can absorb arbitrary tasks without new harness work.

What to Watch

  • Whether later revisions make the meta agent actually read and reason over prior evaluation artifacts instead of relying almost entirely on outer-loop selection pressure
  • Whether select_next_parent.py and ensemble.py become richer than their current placeholder-simple implementations
  • Whether domain onboarding becomes materially more generic, or remains a bespoke harness per benchmark family
  • Whether the diff archive ever grows a second layer of explicit lessons or heuristics, combining executable lineage with semantic abstraction

Relevant Notes: