ACE
Type: note · Status: current · Tags: related-systems
ACE is a Python framework for self-improving language-model behavior through an evolving playbook. It runs a three-role loop: a generator answers with the current playbook, a reflector inspects the attempt against ground truth or environment feedback, and a curator proposes structured playbook operations. Built by ace-agent, Apache-2.0 licensed.
Repository: https://github.com/ace-agent/ace
Core Ideas
Playbook bullets are the learned substrate. ACE does not learn in weights in the reviewed repo. It learns by maintaining a text playbook whose lines carry stable IDs plus helpful and harmful counters. The learned unit is a bulletpoint, not an embedding or model delta.
Reflection is split from curation. The reflector reads question, reasoning trace, prediction, feedback, and used bullet IDs, then emits both freeform reflection text and bullet tags (helpful, harmful, neutral). The curator is a second pass that reads the reflection plus current playbook stats and proposes operations. This is a real separation of roles, not just prompt wording.
The actual implemented write path is append-heavy. The curator schema advertises ADD, UPDATE, MERGE, DELETE, and CREATE_META, but the code path that mutates the playbook only implements ADD. Existing bullet counters are updated separately by the reflector tags, and optional deduplication is handled by a bulletpoint analyzer. So the repo currently implements "tag existing bullets, append new bullets, maybe merge similar ones later," not full CRUD knowledge maintenance.
Offline and online modes share one trace-to-playbook loop. In offline mode the system trains on labeled samples with validation and optional test passes. In online mode it updates the playbook during evaluation. In both cases the learning substrate is still execution feedback flowing into the same reflector-curator pipeline.
Incremental delta updates are real, but still prompt-mediated. ACE is not rewriting the full context every step. The curator is prompted to emit only missing additions, and the playbook utility applies those operations into sectioned text. That is a genuine incremental update mechanism, but the "delta" is still natural-language content generated by an LLM rather than a verified symbolic transform.
Comparison with Our System
ACE is one of the closest systems yet to the artifact-learning side of our trace-derived survey. It is closer to Autocontext than to OpenClaw-RL: repeated runs, execution feedback, durable textual playbooks, and no requirement that the final learning end up in weights.
| Dimension | ACE | Commonplace |
|---|---|---|
| Trace source | Question attempts, reasoning traces, feedback, bullet usage | Human+agent editing traces, notes, links, workshop artifacts |
| Learned substrate | Sectioned playbook text with scored bullets | Notes, links, instructions, workshop artifacts |
| Promotion target | Inspectable text artifacts only in current repo | Inspectable text artifacts only |
| Learning loop | Generator -> reflector -> curator -> playbook update | Human+agent write -> connect -> validate -> mature |
| Update style | Append-heavy delta updates with counters | Manual curation and note revision |
| Oracle strength | Ground truth or environment feedback when available | Weak, mostly human judgment |
| Storage model | Files for playbooks/results, no weight update path in repo | Files in git |
Trace-derived learning placement. On axis 1, ACE fits the trajectory-run pattern: it learns from repeated attempts and their feedback rather than from one live session stream. On axis 2, it is a strong trace-derived artifact-learning system: the promoted result is an inspectable playbook, not weights. It should be added to the survey because it strengthens the artifact-learning side and shows a more explicit counter-based maintenance loop than the current examples.
ACE also sharpens one claim in the survey: artifact-learning systems differ not only by source trace and promotion target, but by whether maintenance is append-only, counter-based, or full-CRUD. ACE currently lands in the middle: richer than pure append, weaker than true curation.
Borrowable Ideas
Bullet-level usage counters. Ready now. The helpful / harmful counters are a concrete intermediate between raw recurrence and full human review. We could apply the same idea to workshop observations or distilled learnings without pretending to have a full oracle.
Separate reflection from promotion. Ready now. ACE does not ask one prompt to both diagnose and mutate the artifact. That separation would make our own workshop reflection experiments easier to reason about.
Operation-shaped curator output. Needs a use case first. Even though ACE only fully implements ADD, the curator prompt and schema force proposed changes into explicit operations. That is a cleaner handoff boundary than freeform prose revision.
Stable IDs on learned artifacts. Ready now for some workshop outputs. Bullet IDs let ACE track usage and update statistics over time. We currently lack stable identifiers for many workshop-derived learnings.
Curiosity Pass
The interesting part of ACE is not the three-role architecture by itself. The mechanism that matters is the combination of bullet IDs, usage tagging, and append-only curation. That means the system's strongest claim is narrower than the README pitch: it is not a general self-improving context engine yet, but a specific pipeline for accumulating and scoring playbook bullets from feedback.
The simplest alternative to much of ACE is "rewrite the whole prompt after each round." ACE's real gain over that baseline is not mystical context engineering; it is preserving local history and attaching counters to reusable advice. That is a genuine mechanism.
The biggest ceiling is maintenance. Because the actual write path is mostly ADD, the playbook can grow and optional deduplication can merge near-duplicates, but the system does not yet have a robust implemented path for revising or retiring bad advice. Even if the reflector is perfect, the maintenance policy is still structurally biased toward accretion.
This makes ACE a useful comparison point for our survey. It is more explicit than freeform reflection-memory loops, but less mature on lifecycle than the README implies.
What to Watch
- Whether
UPDATE,MERGE, andDELETEbecome first-class implemented operations rather than schema promises - Whether the counter mechanism stays useful as playbooks grow, or whether it collapses into prompt bloat
- Whether ACE adds a weight-promotion path, which would move it closer to Autocontext's mixed artifact/weight position
- Whether the online mode proves robust outside benchmark settings with strong feedback signals
Relevant Notes:
- trace-derived learning techniques in related systems — extends: ACE is a strong additional artifact-learning example built from repeated run feedback and explicit playbook mutation
- Autocontext — compares: both learn from repeated runs into playbook-like artifacts, but ACE stays artifact-only while Autocontext also exports to training
- ClawVault — compares: both preserve inspectable learned artifacts, but ClawVault mines live sessions while ACE mines repeated evaluated attempts
- memory management policy is learnable but oracle-dependent — sharpens: ACE shows the same oracle dependence on the artifact side; without feedback, its counters and curator loop have little to stand on
- automating KB learning is an open problem — contrasts: ACE has a clearer verifier than open-ended KB curation, which is why its automated promotion loop can be narrower and more concrete