Ingest Report: SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Type: kb/sources/types/ingest-report.md

Source: skillopt-executive-strategy-self-evolving-agent-skills.md Captured: 2026-05-28 From: https://arxiv.org/html/2605.23904v2

Classification

Type: scientific-paper -- arXiv preprint with a proposed optimization method, benchmark evaluation, baselines, ablations, and limitations. Domains: skill-optimization, deploy-time-learning, trace-derived-learning, readable-artifacts Author signal: multi-author research paper with benchmark tables and ablation/transfer studies, but not yet independently reproduced in this KB.

Summary

SkillOpt treats an agent skill document as trainable external state. A separate optimizer model inspects scored rollouts and proposes bounded add/delete/replace edits to a single skill, while a held-out validation split accepts only strict improvements. The loop keeps rejected edits as negative evidence, uses a textual learning-rate budget plus slow/meta updates to preserve useful strategy, and deploys only a compact best_skill.md artifact at inference. Across six benchmarks, seven target models, and direct-chat/Codex/Claude Code harnesses, the paper reports consistent gains over no-skill, human-written, one-shot, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with transfer across models, harnesses, and some benchmark pairs. The key boundary is evaluator quality: SkillOpt is strongest where scored trajectories and held-out validation are available.

Connections Found

Extractable Value

  • Skill documents can be trained as readable external policy [quick-win]. SkillOpt is strong evidence for treating skill prose as a behavior-shaping artifact that can be improved without weight updates.
  • Validation gates are the practical boundary for automated skill evolution [quick-win]. The strict held-out gate is what lets text edits become a learning loop instead of uncontrolled prompt rewriting.
  • Textual learning-rate budgets and bounded edits reduce skill drift [experiment]. The add/delete/replace budget gives an operational pattern for future skill revision tools: constrain the edit surface before evaluating effects.
  • Rejected edits are retained learning artifacts [experiment]. Keeping a rejected-edit buffer turns failed proposed changes into negative evidence for later optimizer calls.
  • Slow/meta updates split runtime context from optimizer memory [deep-dive]. The deployed skill stays compact, while the optimizer can use richer history and strategy outside the runtime path.
  • Skill transfer suggests learned procedure can outlive one harness [just-a-reference]. Cross-model and Codex/Claude Code transfer support the claim that optimized skills can encode portable task procedure rather than only harness quirks.
  • Single-skill optimization does not solve skill-library governance [just-a-reference]. SkillOpt improves one domain skill; it does not address discovery, routing, retirement, provenance, or conflicts across many skills.

Limitations (our opinion)

  • This is a scientific preprint; independent reproduction and implementation details should be checked before treating the numbers as settled.
  • The arXiv HTML snapshot is sufficient for conceptual ingestion but may lose exact formulas and table formatting; use the PDF for precise numeric or mathematical claims.
  • The method is strongest in domains with scored trajectories and held-out validation. Open-ended, subjective, or sparse-feedback skill domains still need stronger human or model evaluation.
  • The optimized skill can encode benchmark-specific heuristics. Held-out validation and transfer tests reduce this risk but do not eliminate it.
  • The paper optimizes a compact skill artifact, not a full skill library lifecycle with provenance, routing, retirement, and conflict management.

Write a note tentatively titled Skill documents can be trained as readable external policy. It should connect SkillOpt to the readable-artifact loop, deploy-time learning, diagnostic richness, and verifiability gradient notes. The central claim should be that scored rollouts plus held-out validation can make prose skills into trainable external policy artifacts, with the boundary condition that the evaluator must be good enough to make text edits learnable rather than merely plausible.