EQUIPA

Type: agent-memory-system-review · Status: current · Tags: related-systems, trace-derived

EQUIPA is a Python/SQLite multi-agent coding orchestrator by Forgeborn, pitched as "your AI development team": a conversational entry point (MCP server) in front of a task-graph executor that dispatches role-specialised Claude/Ollama agents into per-task git worktrees, iterates through a developer-tester cycle, and writes episodes, lessons, rules, prompt mutations, config deltas, and benchmark results back into one shared SQLite database (theforge.db, 30+ tables). The repo is now v3.1 with an internal benchmarking harness (FeatureBench plus SWE-bench Verified setup), 15 advertised agent roles, and an autoresearch loop that mutates prompts against benchmark targets. It continues to read primarily as an operational control plane for repeated coding work, not a knowledge library.

Repository: https://github.com/sbknana/equipa

Core Ideas

The executor is a SQLite state machine over git worktrees. equipa/cli.py, equipa/dispatch.py, and equipa/loops.py drive tasks off the tasks table; equipa/git_ops.py creates per-task worktrees (forge-task-{id} branches) so parallel agents cannot corrupt each other or main. Every agent turn logs a row in agent_actions (tool name, input hash, success, error class, duration), every run ends as an agent_runs row, and every post-run reflection becomes an agent_episodes row with q_value, reflection, and an optional embedding. The durable substrate is operational telemetry, not prose.

Trace promotion is a ladder, not a bucket. EQUIPA keeps separate stages/scripts with mostly typed table rows across compilation levels: agent_episodes (raw reflections with Q-values), lessons_learned (recurring error signatures with effectiveness scores and embeddings, including SIMBA-generated rules, though the current retrieval path appears to have a simba_generated/simba source-label mismatch), forgesmith_changes (config and prompt mutations with rationale, impact assessment, and rollback column), and GEPA-optimised prompt variants via forgesmith_gepa.py (DSPy/CMA-ES prompt evolution with A/B rollback). forgesmith_impact.py blocks HIGH-risk mutations from auto-applying. The pipeline is explicit — COLLECT → ANALYZE → LESSONS → SIMBA → RUBRICS → APPLY → GEPA → LOG — and each step's output is a separate artifact class, with rubric scores/evolution treated as evaluation artifacts rather than promotion artifacts.

Benchmarks are the primary oracle, and they have their own subsystem. The benchmarks/ directory is the April 2026 focus: a swebench_runner.py for SWE-bench Verified (500-instance dataset, Docker-backed evaluation), featurebench_runner.py for the in-house "harder" benchmark, and CumulativeDB (benchmarks/cumulative_db.py) which extracts lessons_learned, agent_episodes, and decisions from per-container theforge.db files and seeds a master cumulative DB. On subsequent runs this accumulated knowledge is injected back into fresh containers for warm-start. This is the clearest cross-run transfer mechanism in the repo, and it is newer than the lesson layer it consumes.

Autoresearch is an outer loop that tunes prompts against benchmark targets. scripts/autoresearch_loop.py implements an ATLAS-style cycle: collect benchmark metrics per role, mutate the prompt file via a Claude subprocess, deploy the mutated prompt, reset the test project's git state, dispatch benchmark tasks, wait for results, and commit or revert based on metrics. Targets are codified as role success-rate thresholds (e.g. developer 80%, security-reviewer 85%). The claimed result is "6/7 roles at 100%" on the internal benchmark, which the README treats as a reason to prefer autoresearch over fine-tuning for most users.

Anti-drift and anti-compaction controls are architectural, not cosmetic. equipa/monitoring.py implements loop detection, monologue termination, budget tracking, and dynamic budget adjustment; equipa/checkpoints.py writes .forge-state.json soft-checkpoints for long tasks so a replacement agent can resume after compaction; equipa/bash_security.py is a 12+ regex command-injection filter with 80+ test patterns; lesson_sanitizer.py strips prompt-injection content out of lesson text and wraps injected lessons in a <task-input trust="derived"> delimiter block before they ever reach an agent prompt. The repo treats "agents drift, loop, compact, stall, or get prompt-injected via their own memory" as a normal systems problem.

Comparison with Our System

Dimension	EQUIPA	Commonplace
Primary purpose	Execute and improve coding-agent work over repeated tasks and benchmarks	Accumulate and structure durable knowledge for future reasoning and writing
Main substrate	SQLite operational store (30 tables) + prompt/config files + task worktrees	Markdown notes, indexes, instructions, ADRs, and workshop artifacts in git
Learning target	Operational behaviour: episodes, lessons, SIMBA rules, GEPA prompts, config mutations, optional fine-tuning data	Conceptual and methodological knowledge: notes, links, definitions, instructions
Primary oracle	Test outcomes, rubric scores, FeatureBench/SWE-bench success rates, effectiveness scores with rollback	Deterministic validation, semantic review bundles, human judgement
Retrieval model	Tool-mediated DB queries, Ollama embeddings with cosine similarity, PageRank over `lesson_graph_edges`	Descriptions, typed semantic links, indexes, direct file reads
Human inspectability	Medium: code and schema are readable, but most state lives behind tables and MCP tools	High: primary artifacts are directly readable files with articulated relationships
Workshop/library split	Strong workshop (task-outputs, arena results, nightly reviews), thin library	Strong library, emerging workshop layer
Self-modification	Explicit and logged (`forgesmith_changes` with rollback column, impact assessment)	None automated — mutations go through human review

EQUIPA is stronger where the task has a hard oracle and the loop can be benchmarked; commonplace is stronger where the task is curation of inspectable understanding. EQUIPA's useful asymmetry for us is that it has actually wired up three non-trivial pieces our system has not: a typed forgesmith_changes log of automated mutations with rollback, an impact-assessment gate that can refuse high-blast-radius changes before they apply, and a cumulative-knowledge export that survives across isolated benchmark containers. Our system would need analogous machinery the moment we let automation write into the KB.

Borrowable Ideas

Treat automated mutations as a typed change log with rollback, not as edits. forgesmith_changes stores change_type, target_file, old_value, new_value, rationale, evidence, effectiveness_score, reverted_at, and impact_assessment. Any automated KB writer we build later should emit an equivalent record set, not just mutate files in-place. Ready to borrow now as a design constraint even before we have an automated writer.

Gate automated mutations with an explicit blast-radius assessment. forgesmith_impact.py evaluates affected roles, task types, and risk level before apply, blocks HIGH-risk changes from auto-apply, and persists the assessment alongside the change. This is a concrete template for any future review-bundle auto-fix pipeline in commonplace: do not let a low-confidence fixer silently touch files that other workflows depend on. Ready to borrow as a pattern the next time we discuss auto-fix automation.

Cross-container knowledge transfer via an extract/merge/inject cycle. CumulativeDB is a small idea implemented well: when each run is isolated, export the learned tables at the end, merge them into a master DB, and inject them on next startup. The commonplace analog is inter-repo KB transfer (e.g. lessons discovered in a consuming project that should flow back into methodology). The shape — export typed artifacts, merge with source-of-origin tagging, inject selectively — is directly reusable. Needs a concrete use case first; the instinct to borrow this hardens once we have two KBs that should share workshop-derived observations.

Sanitise and wrap trace-derived context before reinjection. lesson_sanitizer.py plus format_lessons_for_injection wrap injected lessons in <task-input type="lessons" trust="derived"> tags with an unpredictable delimiter, then run content sanitisation against prompt-injection. Any workshop-to-library feedback loop we build that re-reads agent output should adopt the same "treat derived text as untrusted" default. Ready to borrow now for review-bundle automation that pipes LLM output back into prompts.

Make the self-improvement pipeline a named sequence with per-stage outputs. The COLLECT → ANALYZE → LESSONS → SIMBA → RUBRICS → APPLY → GEPA → LOG pipeline is long but inspectable; each stage writes to a distinct table and can be dry-run independently (--simba, --gepa, --report). When we eventually automate KB review, a comparable named pipeline with per-stage artifacts is easier to debug than a single synthesis pass. Ready to borrow as a framing tool for the review system now.

Curiosity Pass

The "self-improving team" framing still outruns what the repo closes. The episode → lesson → SIMBA-rule → prompt-patch → config-delta chain is implemented. The claim that prompt evolution can replace fine-tuning for most users is a marketing compression — forgesmith_gepa.py is real, but the GEPA run depends on DSPy, benchmark tasks, and a judge, and the published "6/7 roles at 100%" number is on FeatureBench, which is internal. The recently added SWE-bench Verified setup (April 12 2026) explicitly targets 70–80% and has only calibrated on 20 instances. The machinery is stronger than most peers in this catalog, but the ceiling is still set by oracle quality and benchmark scope, not by any one learning stage.

The weight-learning path remains mostly documentation. docs/TRAINING.md still points at train_qlora.py and train_qlora_peft.py, which are not in the repo. tools/prepare_training_data.py and tools/forge_arena.py are present, plus a new tools/ingest_training_results.py that ingests completion markers into model_registry. So the path is "generate arena data → prepare ChatML JSONL → hand off to external trainer → ingest model registry rows when done." The artifact-to-weights promotion is real-but-offloaded; the trainer is not checked in. The previous review called this split out and it has not changed materially.

The knowledge graph is still lesson-shaped even when the framing is broader. lesson_graph_edges stores src_id, dst_id, edge_type, weight with coaccessed, similarity, and sequence edge types, and equipa/graph.py implements PageRank via power iteration over that table. lessons.py reranks by PageRank. The README talks about a knowledge graph that prioritises past experiences; the implemented graph is a lesson-centric reranker with episode-level retrieval layered on top, not a general multi-entity graph. The simpler honest framing is "PageRank-weighted lesson retrieval." That is still useful; the "knowledge graph" label is aspirational.

"Zero dependencies" applies to the core, not the outskirts. The orchestrator, MCP server, DB layer, loops, monitoring, and graph code are genuinely stdlib. GEPA requires DSPy; embeddings require Ollama; training-data prep requires datasets; SWE-bench harness requires swebench and Docker. The repo's honest design is "stdlib core, optional heavy outskirts," which is still a good rule — it is worth copying the principle (commonplace has an analog: stdlib-only throwaway tooling vs. the runtime package) without importing the marketing.

The role catalog keeps expanding faster than the operational rigor behind most roles. prompts/ ships 15 role files, including game-specific testers (multiplayer-tester, story-tester, economy-tester, world-builder). The CLAUDE.md describes 9 roles. The core coding harness still centers on developer/tester/debugger/security-reviewer/planner-style roles, while the autoresearch benchmark table currently reports seven optimized roles including frontend/game-specific roles. More than a year into the project, the periphery is prompt inventory and aspiration; the center of gravity is the coding harness plus its benchmarks.

Trace-derived learning placement. EQUIPA consumes agent execution telemetry — per-tool agent_actions rows, per-run agent_runs summaries, per-agent agent_episodes with parsed reflection and failure classification — plus benchmark outcomes from FeatureBench and SWE-bench runners. Trigger boundaries are per-turn (action logging), per-run (episode write, Q-value update, quality scoring), per-nightly (forgesmith.py --auto consolidation, lesson extraction, rubric evolution), per-benchmark-round (autoresearch compare-and-commit), and per-container (CumulativeDB extract on benchmark container exit). Extraction produces five named artifact classes: flat lessons keyed by (role, error_signature) with effectiveness scores, SIMBA behavioural rules generated by Claude from high-variance episode clusters, prompt mutations proposed by GEPA through DSPy/CMA-ES, config deltas (turn budgets, concurrency, model routing) chosen by pattern-matching thresholds, and ChatML JSONL training examples from arena runs. Oracles are layered: test pass/fail and build status (hard), rubric quality scores with role-specific weights (soft), effectiveness score before/after with auto-rollback below -0.3 (meta), impact assessment (gate), benchmark success rate (outer loop). Promotion target is primarily filesystem and SQLite artifacts: prompts in prompts/*.md, config in dispatch_config.json, rules and lessons in theforge.db tables — and optionally model weights via an external fine-tuning path that ingests results into model_registry. Scope is multi-project within a single theforge.db keyed by project_id, with CumulativeDB as the only cross-database transfer mechanism; generalisation across benchmark instances is the driver for that design. Timing is online per-turn (action logging, loop detection), offline nightly for ForgeSmith and GEPA, and staged per benchmark round for autoresearch. On the survey's axes: axis 1 — service-owned trace backend (EQUIPA defines its own agent_actions/agent_runs/agent_episodes schema and consumes it), with a secondary trajectory-run mode for benchmark-driven autoresearch and GEPA. Axis 2 — split, like Autocontext: the primary path is symbolic-artifact learning (lessons, SIMBA rules, GEPA-mutated prompts, config), with an optional weights branch via arena → ChatML export → external QLoRA → model_registry ingest. EQUIPA strengthens the survey's claim that typed mutation verbs and per-stage artifact classes scale further than a single reflection buffer, and it warrants a small extension to the artifact-structure spectrum: rollback-logged file mutations as a distinct class (Autocontext has harness mutations, EQUIPA adds typed rollback provenance with impact assessment). No new subtype needed.

What to Watch

Does the SWE-bench Verified harness produce a public number, and does it hold up against published baselines, or does the internal FeatureBench number stay the main advertised result?
Does CumulativeDB gain richer lifecycle management (retirement, contradiction detection, cross-domain filtering), or stay a bulk extract/inject shim for benchmark warm-start?
Does the weight-learning path get checked in end-to-end (train_qlora*.py plus a functioning arena → trainer → registry loop), or remain a documentation-only handoff to an external trainer?
Does the forgesmith_changes log grow stronger lifecycle tooling (pruning, conflict detection across rules, per-role change limits), or accumulate into prompt and config sludge as the 15-role surface keeps widening?
Does the lesson_graph_edges substrate pick up non-lesson entities (episodes, tasks, agents) so the "knowledge graph" label becomes structurally honest, or stay a lesson reranker with PageRank framing?
Does the autoresearch loop converge on role-level success targets that generalise off-benchmark, or does prompt mutation drift toward benchmark-overfit wording?

Relevant Notes:

trace-derived learning techniques in related systems — extends: EQUIPA is a service-owned trace backend with a split promotion target; its typed forgesmith_changes log and CumulativeDB warm-start are concrete extensions of the artifact-structure spectrum
deploy-time learning is the missing middle — exemplifies: EQUIPA's primary adaptive path is durable symbolic mutation (lessons, rules, prompts, config) during deployment rather than retraining-only improvement
constraining during deployment is continuous learning — exemplifies: lessons, SIMBA rules, GEPA prompt variants, and config deltas are all deploy-time behavioural constraints derived from prior runs
a functioning knowledge base needs a workshop layer, not just a library — contrasts: EQUIPA has a strong operational workshop (task-outputs, benchmark runners, arena, nightly review) but almost no library layer, which clarifies what our stack still lacks on the execution side
automating KB learning is an open problem — contrasts: EQUIPA automates the parts with strong operational oracles (tests, benchmarks, rubric scores) and leaves the richer synthesis problem outside its scope
files beat a database for agent-operated knowledge bases — complicates: EQUIPA is a good counterexample where operational state genuinely benefits from SQLite, even though inspectability is weaker than files-first knowledge systems
CORAL — sibling: both are coding-agent harnesses with git-worktree isolation, but EQUIPA invests much more in trace-derived self-modification and owns its benchmark harness
Autocontext — sibling: both are trajectory-run outer loops that split promotion between artifacts and optional weights; EQUIPA is more repo-operational (task graph, MCP, impact-gated mutations), Autocontext more explicit about cross-generation knowledge artifacts like playbooks
ACE — contrasts: ACE splits tagging from mutation through deterministic code; EQUIPA splits mutation from rollback through a typed change log with impact assessment, a stronger provenance guarantee at the cost of a richer schema

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search