ACE

Type: note · Status: current · Tags: related-systems

ACE is a Python framework for self-improving language-model behavior through an evolving playbook. It runs a three-role loop: a generator answers with the current playbook, a reflector inspects the attempt against ground truth or environment feedback, and a curator proposes structured playbook operations. Built by ace-agent, Apache-2.0 licensed.

Repository: https://github.com/ace-agent/ace

Core Ideas

Playbook bullets are the learned substrate. ACE does not learn in weights in the reviewed repo. It learns by maintaining a text playbook whose lines carry stable IDs plus helpful and harmful counters. The learned unit is a bulletpoint, not an embedding or model delta.

Reflection is split from curation. The reflector reads question, reasoning trace, prediction, feedback, and used bullet IDs, then emits both freeform reflection text and bullet tags (helpful, harmful, neutral). The curator is a second pass that reads the reflection plus current playbook stats and proposes operations. This is a real separation of roles, not just prompt wording.

The actual implemented write path is append-heavy. The curator schema advertises ADD, UPDATE, MERGE, DELETE, and CREATE_META, but the code path that mutates the playbook only implements ADD. Existing bullet counters are updated separately by the reflector tags, and optional deduplication is handled by a bulletpoint analyzer. So the repo currently implements "tag existing bullets, append new bullets, maybe merge similar ones later," not full CRUD knowledge maintenance.

Offline and online modes share one trace-to-playbook loop. In offline mode the system trains on labeled samples with validation and optional test passes. In online mode it updates the playbook during evaluation. In both cases the learning substrate is still execution feedback flowing into the same reflector-curator pipeline.

Incremental delta updates are real, but still prompt-mediated. ACE is not rewriting the full context every step. The curator is prompted to emit only missing additions, and the playbook utility applies those operations into sectioned text. That is a genuine incremental update mechanism, but the "delta" is still natural-language content generated by an LLM rather than a verified symbolic transform.

Comparison with Our System

ACE is one of the closest systems yet to the artifact-learning side of our trace-derived survey. It is closer to Autocontext than to OpenClaw-RL: repeated runs, execution feedback, durable textual playbooks, and no requirement that the final learning end up in weights.

Dimension	ACE	Commonplace
Trace source	Question attempts, reasoning traces, feedback, bullet usage	Human+agent editing traces, notes, links, workshop artifacts
Learned substrate	Sectioned playbook text with scored bullets	Notes, links, instructions, workshop artifacts
Promotion target	Inspectable text artifacts only in current repo	Inspectable text artifacts only
Learning loop	Generator -> reflector -> curator -> playbook update	Human+agent write -> connect -> validate -> mature
Update style	Append-heavy delta updates with counters	Manual curation and note revision
Oracle strength	Ground truth or environment feedback when available	Weak, mostly human judgment
Storage model	Files for playbooks/results, no weight update path in repo	Files in git

Trace-derived learning placement. On axis 1, ACE fits the trajectory-run pattern: it learns from repeated attempts and their feedback rather than from one live session stream. On axis 2, it is a strong trace-derived artifact-learning system: the promoted result is an inspectable playbook, not weights. It should be added to the survey because it strengthens the artifact-learning side and shows a more explicit counter-based maintenance loop than the current examples.

ACE also sharpens one claim in the survey: artifact-learning systems differ not only by source trace and promotion target, but by whether maintenance is append-only, counter-based, or full-CRUD. ACE currently lands in the middle: richer than pure append, weaker than true curation.

Borrowable Ideas

Bullet-level usage counters. Ready now. The helpful / harmful counters are a concrete intermediate between raw recurrence and full human review. We could apply the same idea to workshop observations or distilled learnings without pretending to have a full oracle.

Separate reflection from promotion. Ready now. ACE does not ask one prompt to both diagnose and mutate the artifact. That separation would make our own workshop reflection experiments easier to reason about.

Operation-shaped curator output. Needs a use case first. Even though ACE only fully implements ADD, the curator prompt and schema force proposed changes into explicit operations. That is a cleaner handoff boundary than freeform prose revision.

Stable IDs on learned artifacts. Ready now for some workshop outputs. Bullet IDs let ACE track usage and update statistics over time. We currently lack stable identifiers for many workshop-derived learnings.

Curiosity Pass

The interesting part of ACE is not the three-role architecture by itself. The mechanism that matters is the combination of bullet IDs, usage tagging, and append-only curation. That means the system's strongest claim is narrower than the README pitch: it is not a general self-improving context engine yet, but a specific pipeline for accumulating and scoring playbook bullets from feedback.

The simplest alternative to much of ACE is "rewrite the whole prompt after each round." ACE's real gain over that baseline is not mystical context engineering; it is preserving local history and attaching counters to reusable advice. That is a genuine mechanism.

The biggest ceiling is maintenance. Because the actual write path is mostly ADD, the playbook can grow and optional deduplication can merge near-duplicates, but the system does not yet have a robust implemented path for revising or retiring bad advice. Even if the reflector is perfect, the maintenance policy is still structurally biased toward accretion.

This makes ACE a useful comparison point for our survey. It is more explicit than freeform reflection-memory loops, but less mature on lifecycle than the README implies.

What to Watch

Whether UPDATE, MERGE, and DELETE become first-class implemented operations rather than schema promises
Whether the counter mechanism stays useful as playbooks grow, or whether it collapses into prompt bloat
Whether ACE adds a weight-promotion path, which would move it closer to Autocontext's mixed artifact/weight position
Whether the online mode proves robust outside benchmark settings with strong feedback signals

Relevant Notes:

trace-derived learning techniques in related systems — extends: ACE is a strong additional artifact-learning example built from repeated run feedback and explicit playbook mutation
Autocontext — compares: both learn from repeated runs into playbook-like artifacts, but ACE stays artifact-only while Autocontext also exports to training
ClawVault — compares: both preserve inspectable learned artifacts, but ClawVault mines live sessions while ACE mines repeated evaluated attempts
memory management policy is learnable but oracle-dependent — sharpens: ACE shows the same oracle dependence on the artifact side; without feedback, its counters and curator loop have little to stand on
automating KB learning is an open problem — contrasts: ACE has a clearer verifier than open-ended KB curation, which is why its automated promotion loop can be narrower and more concrete

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search