EQUIPA
Type: ../types/agent-memory-system-review.md · Status: current · Tags: trace-derived
EQUIPA is a Python coding-agent orchestrator by sbknana / Forgeborn. It treats Claude Code or Ollama-backed agents as a development team: tasks live in a SQLite "TheForge" database, specialized roles get dispatched into dev/test/security/review loops, parallel work can run in isolated git worktrees, and several self-improvement loops mine execution history into lessons, rules, prompt changes, config changes, benchmark decisions, and optional fine-tuning data. It is not just a memory library; it is an agent runtime whose memory surfaces directly affect future dispatch behavior.
Repository: https://github.com/sbknana/equipa
Reviewed revision: 279f39fdd8336fe154c458fbd6e0dab6a91d81b0
Core Ideas
The durable substrate is a SQLite operations database, not a note corpus. EQUIPA's canonical schema stores projects, tasks, project context, run telemetry, lessons, episodes, mutation logs, rubric scores, model registry entries, inter-agent messages, action traces, flow revisions, config snapshots, and session state in one local SQLite database (schema.sql, equipa/db.py). This makes the DB simultaneously an operational queue, a memory store, an evaluation log, and a control-plane ledger. The stored rows are knowledge artifacts when queried as evidence or history; many become system-definition artifacts when they route tasks, inject prompt content, tune prompts/configs, or gate merges.
Agent execution is instrumented at several different granularities. agent_runs records role, model, turns, cost, outcome, prompt version, errors, and file-change counts after each dispatch (equipa/db.py). agent_actions stores per-tool action records from the streaming runner, while agent_sessions captures bounded resume/postmortem state such as open files, changed files, recent tool calls, partial reasoning, compaction count, and soft-checkpoint path (equipa/agent_runner.py, equipa/sessions.py). Large tool outputs can be persisted to session-local files with preview references to avoid context bloat (equipa/tool_result_storage.py). These raw and lightly compacted traces are not all the same artifact: some are diagnostic logs, some are resumability state, and some feed learning loops.
Prompt-time memory is assembled from distilled lessons and episodic records. ForgeSmith extracts repeated error patterns from agent_runs into lessons_learned, while code-review and security-review findings can also become developer lessons (forgesmith.py, equipa/loops.py). Separately, agent_episodes stores approach summaries, outcomes, error patterns, reflections, q-values, embeddings, and injection counts (equipa/lessons.py). build_system_prompt then fetches, deduplicates, wraps, and injects lessons and episodes into the dynamic suffix of the role prompt, while keeping role instructions in a static cacheable prefix (equipa/prompts.py). The prompt injection path gives these rows instruction-like authority even when the rows themselves are stored as prose.
Retrieval is hybrid but compact. Episode selection starts with same-role/project/task-type filters and q-value thresholds, then reranks with recency, keyword overlap, optional Ollama embeddings, and optional PageRank over a lesson graph (equipa/lessons.py, equipa/embeddings.py, equipa/graph.py). The representational form is mixed: lessons and reflections are prose, q-values and counters are symbolic telemetry, embeddings are distributed-parametric records serialized into SQLite, and graph edges are symbolic relationships that alter ranking.
Self-improvement mutates both prompt/config artifacts and database memory. ForgeSmith can tune dispatch_config.json, append prompt patches, reset or escalate blocked tasks, log forgesmith_changes, run SIMBA, run GEPA, evaluate changes, and roll back underperformers (forgesmith.py). SIMBA analyzes successful/failing agent_episodes, asks Claude for role-specific rules, stores validated rules back into lessons_learned, evaluates before/after success rates, and prunes stale rules (scripts/forgesmith_simba.py). GEPA converts episodes into DSPy examples, evolves role prompts, writes versioned prompt files, records the mutation in forgesmith_changes, and supports A/B selection through prompt-building telemetry (forgesmith_gepa.py, equipa/prompts.py). Config snapshots add a separate rollbackable lineage layer for prompt/config files, with redaction and atomic writes (equipa/config_versions.py).
Work execution is isolated, evaluated, and partially governed. Multi-task dispatch can create per-task git worktrees and merge only successful branches; security review artifacts can block merges on critical/high findings, and missing artifacts block by default unless configured otherwise (equipa/dispatch.py, tests/test_dispatch_parallel_security_review.py). The dev-test loop records episodes, lessons, security findings, quality scores, and q-value updates as side effects of real task execution (equipa/loops.py, equipa/reflexion.py). This is stronger than a passive memory store: the system changes what agents see and whether their branches merge.
There are two export stories with different authority. The project-template exporter writes project-scoped JSONL for projects, tasks, decisions, session notes, open questions, lessons, episodes, and runs, plus selected assets, while explicitly excluding API keys and model registry rows (equipa/templates.py). Forge Arena can export agent_episodes into LoRA-ready ChatML JSONL for model training, while prepare_training_data.py builds a separate external-dataset QLoRA corpus and ingest_training_results.py records completed model runs in model_registry (tools/forge_arena.py, tools/prepare_training_data.py, tools/ingest_training_results.py). Template export transfers operational memory as data; LoRA export converts episodes into learning input with much stronger behavioral authority if a trained adapter is later deployed.
Comparison with Our System
| Dimension | EQUIPA | Commonplace |
|---|---|---|
| Primary substrate | SQLite operational database plus prompt/config files, worktrees, logs, and exports | Git-tracked markdown KB with generated indexes and validation |
| Main consumer | Acting coding agents, dispatch loops, self-improvement scripts, MCP callers | Agents and maintainers reading/writing knowledge artifacts under type contracts |
| Memory creation | Automatic from runs, reviewer findings, reflections, benchmark loops, and rule/prompt optimizers | Deliberate notes, reviews, indexes, instructions, ADRs, workshop artifacts |
| Retrieval | SQL filters, q-values, recency, keyword overlap, optional embeddings, graph PageRank | rg, descriptions, indexes, explicit links, type-guided navigation |
| Behavioral authority | Lessons and episodes inject into prompts; config/prompt mutations change future agents; gates affect merges | Notes advise; instructions/skills/validators enforce or guide through explicit files |
| Lineage | Mostly database rows, run IDs, mutation logs, snapshots, and benchmark windows | File history, frontmatter, source links, review status, authored relationships |
| Evaluation | Success rates, q-values, rubric scores, benchmark sweeps, GEPA/SIMBA rollback/pruning | Validation, semantic review bundles, human/agent curation, design review |
| Portability | Project template JSONL and optional training exports | Plain markdown repo can be inspected, forked, reviewed, and diffed directly |
EQUIPA is much more operational than commonplace. It has a live runtime, hard feedback loops, isolated work execution, merge gates, q-value updates, benchmark-driven prompt search, and optional weight-learning export. Commonplace is more explicit about artifact contracts: a note's storage, representational form, lineage, status, and intended reader are visible in the artifact and collection rules. EQUIPA's behavior-shaping state is real, but it is spread across rows, prompt files, config files, worktree state, and scripts.
The clearest design divergence is authority placement. In commonplace, stronger authority usually means promotion into an inspectable system-definition artifact: instruction, skill, validator, ADR, type spec, or command. In EQUIPA, stronger authority often means a row is active, injected, selected by a feature flag, scored by q-value, or written into prompt/config files by an optimizer. That gives faster closed-loop adaptation but makes review harder: the same lessons_learned table can hold generic ForgeSmith lessons, SIMBA rules, reviewer-derived warnings, embeddings, counters, and active/inactive lifecycle state.
Read-back: both — prompt construction injects selected lessons and episodes, and agents can query lessons/logs through MCP or CLI.
Borrowable Ideas
Artifact split for operational traces. Ready to borrow conceptually. EQUIPA usefully separates run summaries, tool actions, resumability state, inter-agent messages, lessons, episodes, mutation logs, and config snapshots. Commonplace should keep that split if it ever adds richer workshop telemetry: raw trace, compact episode, distilled lesson, system-definition mutation, benchmark result, and export artifact should not collapse into one "memory" bucket.
Injection counters and outcome-weighted memory. Worth a small experiment before adopting. times_injected, q-values, before/after success rates, and stale-rule pruning give memory a feedback channel beyond "exists in the KB." A commonplace analogue would score whether an instruction or note was loaded before successful/failed work, but only if the harness can collect that signal without creating noisy incentives.
Config/prompt snapshots as first-class lineage. Ready to borrow for agent-facing generated artifacts. EQUIPA's config_versions layer is simple and useful: snapshot small prompt/config files, redact secrets, deduplicate by aggregate hash, diff, and rollback. Commonplace generated indexes already have regeneration paths; high-authority generated instruction bundles would benefit from similarly explicit snapshot lineage.
Benchmark-gated prompt mutation kept outside the library layer. Needs a concrete target. Autoresearch, GEPA, and Forge Arena are useful because they run in a benchmark/workshop lane, not because every prompt change should be automatic. Commonplace could borrow the pattern for evaluation harnesses while keeping promoted instructions human-readable and reviewable.
Security-review findings as future developer lessons. Ready in principle, with curation. EQUIPA's path from review artifact to developer lesson is a tight trace-to-prevention loop. In commonplace, a valid version would turn recurring review findings into a note, checklist, validator, or skill only after source examples and scope are clear.
Runtime-neutral template export. Ready as a reference if consuming projects need portable seeds. EQUIPA's JSONL export of project rows plus assets is less inspectable than markdown, but it is a practical way to move operational state between runtimes while excluding secrets and host-local model registry state.
Trace-derived learning placement
EQUIPA strongly qualifies as trace-derived learning. It has multiple trace sources and several promotion targets with different behavioral authority.
Trace source. The raw signals include agent_runs execution telemetry, agent_actions tool traces, agent_sessions resume/postmortem snapshots, agent_messages between roles, task descriptions and project context, security/code review artifacts, dev-test outcomes, Reflexion text extracted from agent output, Forge Arena benchmark episodes, and autoresearch benchmark results (schema.sql, equipa/agent_runner.py, tools/forge_arena.py, scripts/autoresearch_loop.py). These should be treated separately: action logs and sessions are raw/near-raw traces, episodes are compacted trace summaries, lessons/rules are distilled prose, prompt/config changes are system-definition mutations, and LoRA JSONL is learning input.
Extraction. ForgeSmith groups repeated failures from agent_runs into generic lessons; reviewer gates extract code/security findings into developer lessons; Reflexion extracts approach/reflection/error patterns into agent_episodes; SIMBA contrasts successes and failures to generate role-specific rules; GEPA uses episode examples and failure traces to propose prompt mutations; autoresearch mutates prompts against benchmark outcomes; Forge Arena converts episodes into ChatML training examples (forgesmith.py, equipa/loops.py, scripts/forgesmith_simba.py, forgesmith_gepa.py, tools/forge_arena.py). The oracles vary: heuristic grouping, review severity, task outcome, q-value updates, before/after success rates, benchmark target percentage, and GEPA/DSPy scoring.
Storage substrate. Raw and distilled state mostly persists in SQLite: agent_runs, agent_actions, agent_sessions, agent_messages, agent_episodes, lessons_learned, lesson_graph_edges, forgesmith_changes, rubric_scores, rubric_evolution_history, config_versions, and model_registry. Some high-authority artifacts live as files: role prompts under prompts/, standing orders, dispatch/config JSON, worktree branches, security-review markdown artifacts, template export JSONL, Arena .arena-exports/*.jsonl, and training result marker JSON files.
Representational form. The system uses mixed forms. Prose appears in lessons, reflections, rules, prompt patches, standing orders, task descriptions, and review findings. Symbolic state appears in SQL rows, q-values, counters, config keys, branch names, status fields, feature flags, migration versions, and JSONL exports. Distributed-parametric state appears in Ollama embeddings stored as JSON vectors and in any LoRA/model artifacts produced from exported training data. The important operative split is not "database versus files"; it is whether a row is only evidence, prompt-time context, routing/ranking policy, configuration, evaluation input, or model-training input.
Lineage. EQUIPA has partial lineage, but it is uneven. forgesmith_changes records run ID, target file, old/new value, rationale, evidence, effectiveness score, impact assessment, and revert time. config_versions stores file blobs with parent version IDs and content hashes. agent_episodes link to task IDs and project IDs, and Forge Arena exports include episode metadata. Lessons retain source, error signature, times seen, times injected, active status, and effectiveness score, but they do not preserve full source traces inline. Regeneration is possible for many derived rows if the underlying runs remain in the DB; exported LoRA data and applied prompt/config changes are stronger artifacts that need their own retention and rollback discipline.
Behavioral authority. Raw traces are knowledge artifacts when inspected for diagnosis, reporting, or evidence. Episodes become advice-bearing knowledge artifacts when injected as "Past Experience." Lessons and SIMBA rules become system-definition artifacts at prompt time because they instruct future agents and are selected by feature flags, role, q-value, and retrieval policy. forgesmith_changes, GEPA prompt versions, config snapshots, and dispatch config are system-definition artifacts because they configure future runs. Rubric scores, benchmark results, and model registry entries have evaluation authority. LoRA training JSONL becomes learning input; if used to produce/deploy an adapter, the resulting model artifact has distributed-parametric behavioral authority.
Scope. Most memory is per project, per role, and per task type. Some tables allow cross-project transfer by role/task type when same-project episodes are sparse. Template export moves project-scoped operational memory. SIMBA and GEPA operate role-wise and can become cross-task behavior. Training export is the broadest transfer path because it can move behavior from episodes into model weights.
Timing. The runtime loop is online: prompt injection, episode recording, q-value updates, action logging, review lesson creation, and security merge gates happen during or right after task execution. ForgeSmith/SIMBA/GEPA/autoresearch are staged offline or scheduled loops over accumulated traces. Template and LoRA exports are batch operations.
Survey placement. On the trace-derived learning survey axes, EQUIPA spans all three major lanes: trajectory-to-prose lessons/rules, trajectory-to-system-definition prompt/config mutations, and trajectory-to-training-data export. It strengthens the survey claim that serious coding-agent memory systems are moving beyond retrieval into artifact and prompt evolution. It also splits the category: EQUIPA shows why raw execution traces, compacted episodes, active prose lessons, optimizer mutation logs, benchmark results, portable exports, and model-training artifacts need separate taxonomy slots.
Curiosity Pass
The memory architecture is rich but crowded. lessons_learned carries ForgeSmith lessons, SIMBA rules, reviewer findings, embeddings, counters, active flags, and effectiveness scores. That makes retrieval easy, but the artifact contract is overloaded. A future maintainer has to know source, active, times_injected, effectiveness_score, and prompt-builder behavior to understand a lesson's real authority.
The README's "self-improving agents" claim is substantially implemented, but not uniformly governed. There are rollback thresholds, impact assessments, protected prompt sections, GEPA A/B versions, SIMBA pruning, and security merge gates. Still, several loops depend on heuristic parsing and prompt-generated changes. The code has safety rails; it does not have a clean review queue separating "candidate lesson" from "active instruction" for every path.
The training story has two distinct paths that are easy to conflate. prepare_training_data.py mostly builds a corpus from external coding datasets. Forge Arena's export_lora_data is the trace-derived path from EQUIPA episodes into ChatML JSONL. ingest_training_results.py records model completion markers, not the training itself. A review should not treat all "training" files as evidence that live EQUIPA traces automatically become deployed weights.
The filesystem still matters even in a database-first system. Worktrees, prompt files, config JSON, standing orders, large tool outputs, markdown security-review artifacts, JSONL exports, backups, and training markers all sit outside SQLite. EQUIPA is best described as SQLite-centered, not database-only.
The MCP server exposes useful memory surfaces but stays narrow. It can dispatch tasks, create tasks, query lessons, query agent logs, and read project context over JSON-RPC stdio (equipa/mcp_server.py). It is a practical integration layer, but most high-authority mutation and evaluation behavior still lives in CLI/scripts rather than MCP tools.
What to Watch
- Whether
lessons_learnedsplits into separate tables or artifact contracts for lessons, reviewer findings, SIMBA rules, and active prompt instructions. - Whether GEPA/SIMBA/autoresearch changes get an explicit human-review or promotion queue before becoming active system-definition artifacts.
- Whether action traces and session snapshots become first-class evidence for lesson generation, not just observability and resumability data.
- Whether Forge Arena exports are actually used to train/deploy adapters, and whether model registry rows link back to exact episode/export hashes.
- Whether template export/import becomes a stable cross-project memory-transfer path, especially with embedding regeneration and lesson deactivation semantics.
- Whether security and code-review gates remain artifact-based and fail-closed as the parallel worktree merge path evolves.
Relevant Notes:
- Behavioral authority - defined-in: separates diagnostic run records from prompt-injected lessons, merge gates, config changes, and training inputs
- Representational form - defined-in: EQUIPA mixes prose lessons, symbolic SQL/config state, embeddings, and possible model artifacts
- Lineage - defined-in: useful for evaluating
forgesmith_changes,config_versions, exported JSONL, and prompt versions - Retained artifact - defined-in: EQUIPA's rows/files matter only when they can change later agent behavior
- CORAL - compares-with: both use multi-agent coding orchestration and worktree isolation, but EQUIPA has a larger SQLite memory and self-improvement loop
- auto-harness - compares-with: both use benchmark feedback to improve coding-agent behavior; EQUIPA generalizes the loop across roles, prompts, configs, and traces
- Meta-Harness - compares-with: both learn from coding-agent rollouts, but EQUIPA's retained artifacts include operational DB rows and prompt/config mutations as well as benchmark outputs
- AgentFly - compares-with: both convert judged agent runs into reusable case memory/training signal, but EQUIPA is a live coding orchestrator rather than a Memento research pipeline