Agent Skills for Context Engineering

Type: ../types/agent-memory-system-review.md · Tags: trace-learning

muratcankoylan/Agent-Skills-for-Context-Engineering, by Muratcan Koylan, is a Claude Code/Open Plugins skill marketplace for context engineering, harness engineering, evaluation, memory, filesystem context, and related agent-system design topics. At the reviewed commit it is not primarily a runtime memory server. Its standing memory surface is an authored skill corpus plus a repo-native researcher/ operating system for turning external research, benchmarks, and reviewed mechanisms into future skill changes. It also ships a trace-learning example under examples/interleaved-thinking/ that captures reasoning/tool traces, analyzes failures, optimizes prompts, and can generate reusable skills.

Repository: https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering

Reviewed commit: 25e1fa79a33f0985793bcab3c64dde8d020c5132

Last checked: 2026-06-04

Core Ideas

Skills are baseline context, not accumulated memory. The plugin manifests list 15 skill directories, and each skills/*/SKILL.md is authored natural-language content with frontmatter descriptions, activation boundaries, guidance, examples, and references (.claude-plugin/marketplace.json, SKILL.md, skills/memory-systems/SKILL.md). Claude Code or another host may activate those skills by task context, but that activation is over shipped documentation, not read-back from a store that grows through use.

Context efficiency is progressive disclosure through skill routing. The README states that agents load names/descriptions first and full skill content only when relevant, and the root skill repeats the skill map as a compact collection index (README.md, SKILL.md). This bounds context volume by loading only the selected skill body. Complexity is managed by one skill per concern, but the actual selection quality depends on the host router and the natural-language descriptions.

The researcher OS makes the corpus a file-backed improvement loop. researcher/ defines source queues, run directories, rubrics, mechanism ledgers, claim provenance, a corpus index, activation fixtures, benchmark harnesses, and deterministic validators for skill-corpus health (researcher/README.md, researcher/corpus/index.json, researcher/mechanisms/registry.jsonl). This is the repository's strongest memory mechanism: it stores what the maintainers have accepted, rejected, measured, and still need to review.

The write path is gated rather than autonomous. research_loop.py can initialize runs, record retrieval/evaluation/proposal states, run novelty checks, validate readiness, promote mechanisms, and close runs; promotion requires a reviewer and passing readiness unless explicitly overridden (researcher/scripts/research_loop.py, researcher/scripts/validate_run.py). The continuous loop can fetch and stage work, but it parks runs that need evaluation or model/human action instead of silently mutating skill content (researcher/scripts/loop_step.py).

There is implemented trace-to-skill tooling, but it is example scope. The interleaved-thinking example defines ReasoningTrace, ThinkingBlock, and ToolCall records, captures model thinking and tool results, analyzes failure patterns, optimizes prompts, saves iteration artifacts, and generates SKILL.md files with reference data (examples/interleaved-thinking/reasoning_trace_optimizer/models.py, examples/interleaved-thinking/reasoning_trace_optimizer/capture.py, examples/interleaved-thinking/reasoning_trace_optimizer/skill_generator.py). The review treats this as real trace-learning machinery inside the repo, but not as the marketplace's normal activation path.

Artifact analysis

Storage substrate: files - The durable surfaces are repository files: skill Markdown, plugin manifests, researcher JSON/JSONL/Markdown records, benchmark fixtures/results, workflow definitions, example generated skills, and optional local runtime queues/runs (README.md, researcher/README.md).
Representational form: natural-language symbolic - Skill bodies, run threads, runbooks, rubrics, insights, and generated skills are natural-language; manifests, JSONL registries, run-state JSON, claim IDs, activation fixtures, benchmark goldens, scripts, and CI workflows are symbolic surfaces that validators and routers consume.
Lineage: authored imported trace-extracted - The published skill corpus is mostly authored and research-imported; researcher runs import external evidence into proposals/mechanisms/claims; the interleaved-thinking example can derive optimized prompts and generated skills from captured reasoning/tool traces.
Behavioral authority: knowledge instruction routing validation ranking learning - Skill instructions direct agents when loaded; frontmatter descriptions and manifests route activation; researcher validators and CI enforce structure; novelty checks and benchmark reports rank or compare candidates; mechanism and claim registries preserve learned corpus decisions as future guidance.

Skill packages. Storage substrate: skills/<name>/SKILL.md plus any local references/scripts. Representational form: natural-language instructions with symbolic frontmatter. Lineage: authored and later revised from research, benchmarks, and mechanism decisions. Behavioral authority: instruction for the activated agent, routing through descriptions, and knowledge/reference through examples and external citations. Effective downstream obedience is host- and model-dependent; the repository measures routing and structural health, not every task outcome.

Plugin manifests and root collection skill. Storage substrate: .claude-plugin/marketplace.json, .plugin/plugin.json, and root SKILL.md. Representational form: symbolic plugin metadata plus natural-language skill map. Lineage: authored package metadata. Behavioral authority: routing and installation authority because these files tell hosts which skill directories exist and how to describe the bundle.

Researcher OS records. Storage substrate: researcher/ files, including run directories, queues, rubrics, mechanism ledgers, claims, corpus indexes, reports, and templates. Representational form: natural-language content plus symbolic JSON/JSONL state. Lineage: imported source evidence, authored evaluations/proposals, generated validator reports, and human-reviewed promotions. Behavioral authority: knowledge for maintainers, validation for publish readiness, and learning because accepted/rejected mechanisms become institutional memory for later proposals.

Deterministic validators and CI. Storage substrate: Python scripts under researcher/scripts/ and .github/workflows/validate.yml. Representational form: symbolic executable checks. Lineage: authored harness logic. Behavioral authority: validation and enforcement over skill shape, manifest sync, mechanism registry consistency, claim provenance, activation fixtures, and benchmark scenarios (researcher/scripts/validate_repo.py, .github/workflows/validate.yml).

Benchmark and router artifacts. Storage substrate: router prompts, prompt fixtures, result reports, SDK runner code, and history JSONL under researcher/benchmarks/ and researcher/reports/. Representational form: symbolic fixtures/results plus natural-language reports. Lineage: authored fixtures and generated benchmark outputs. Behavioral authority: ranking/evaluation evidence used to rewrite activation descriptions and harden the corpus; the README records measured top-1/top-3 outcomes and deltas (README.md, researcher/benchmarks/router/README.md).

Trace-extracted example artifacts. Storage substrate: examples/interleaved-thinking/optimization_artifacts/, generated skill directories, and optional runtime output directories. Representational form: natural-language traces/reports, symbolic dataclass objects and JSON references, and generated natural-language skill files. Lineage: trace-extracted from reasoning blocks, tool calls, analyses, optimizations, and loop iterations. Behavioral authority: learning when converted into generated skills or optimized prompts; otherwise evidence/knowledge for debugging.

The system has a promotion path, but it is governance-first. Research evidence can become run artifacts, a skill proposal, a novelty result, validated readiness, promoted mechanism entries, claim provenance, corpus-index updates, and eventually revised skill instructions. The interleaved-thinking example has a separate trace-to-skill promotion path. Neither path automatically makes generated content high-authority without validation or human review.

Comparison with Our System

Agent Skills for Context Engineering and Commonplace share a file-first premise: durable knowledge should be inspectable, diffable, and searchable without a hosted service. Both also treat methodology as an accumulated corpus rather than as application-specific memory. The closest Commonplace analogue to this repository is not one review note; it is the combination of collection contracts, skills, validation commands, review gates, source snapshots, and generated indexes.

The main divergence is type discipline. Commonplace gives each durable artifact an explicit collection/type contract and uses links, schemas, and validation to constrain interpretation. Agent Skills relies more on a plugin/skill convention: every skill has frontmatter and required sections, and validate_repo.py enforces corpus health, but the content model is flatter than Commonplace's note/source/review/reference/instruction split.

The researcher OS is the strongest alignment. It records source provenance, proposed deltas, accepted/rejected mechanisms, claim volatility, run state, readiness checks, and CI-backed validation. Commonplace has similar instincts in source snapshots, review runs, gates, and archival replacement; this repo packages those instincts around one specific corpus: improving agent skills.

The read-back story is different. Commonplace agents deliberately pull from rg, indexes, links, and skills inside a repo-local KB. Agent Skills depends on the host skill router to select static skill instructions, while the accumulated researcher memory is mostly pull-oriented: maintainers or scripts inspect registries, runs, reports, and benchmarks when improving the corpus.

Borrowable Ideas

Mechanism registry as the unit of learned behavior. Commonplace could borrow a JSONL registry for accepted/rejected operational mechanisms extracted from notes, reviews, and incidents. Ready only when there is a concrete consumer, such as review triage or a note-writing gate; otherwise it risks duplicating notes.

Separate corpus-health validation from run-readiness validation. The split between validate_repo.py and validate_run.py is directly useful. Commonplace already has global validation and review gates; a clearer per-workshop or per-ingest readiness validator would reduce "globally valid but locally unfinished" ambiguity. Ready now for workflows that produce run directories.

Router benchmarks for activation descriptions. The benchmark loop treats skill descriptions as a measurable routing surface, not natural-language taste. Commonplace skills and instructions could use fixture prompts to catch boundary drift. Ready when there are enough skills with ambiguous activation boundaries.

Claim provenance for volatile numeric claims. researcher/claims/index.jsonl is a small, practical way to keep benchmark/vendor claims from rotting in prose. Commonplace could apply this to external-system reviews and comparative notes. Ready now for high-volatility claims.

Do not borrow the flat marketplace as the whole KB model. The plugin shape is good for distribution, but Commonplace should keep stronger artifact classes and authority boundaries than "all useful guidance is a skill."

Write side

Write agency: manual automatic - The skill corpus itself changes through authored repository edits and human-reviewed promotions; the researcher scripts automatically create run scaffolds, update run state, fetch sources when allowed, write validation/novelty/readiness reports, and append promotion ledger entries. The interleaved-thinking example automatically writes optimization artifacts and generated skill files when its loop/generator is run.

Curation operations: synthesize promote - The implemented automatic paths synthesize new run/proposal/report/generated-skill artifacts from sources or traces, and promote reviewed mechanism proposals into the registry/ledgers. I found deterministic overlap checks and reviewer gates, but not automatic deduplication, decay, or invalidation of existing skills as a standing maintenance process.

Trace-learning

Trace source: tool-traces session-logs - The trace example captures thinking blocks, tool calls, tool results, final responses, and session metadata in ReasoningTrace; the optimizer loop saves per-iteration trace, analysis, optimization, and final prompt artifacts (examples/interleaved-thinking/reasoning_trace_optimizer/models.py, examples/interleaved-thinking/reasoning_trace_optimizer/loop.py).

Extraction is model-judged and pattern-oriented: TraceAnalyzer prompts M2.1 to classify context degradation, tool confusion, instruction drift, hallucination, goal abandonment, missing validation, and related patterns, then PromptOptimizer and SkillGenerator turn analyses and key changes into optimized prompts or generated skills (examples/interleaved-thinking/reasoning_trace_optimizer/analyzer.py, examples/interleaved-thinking/reasoning_trace_optimizer/optimizer.py, examples/interleaved-thinking/reasoning_trace_optimizer/skill_generator.py).

Learning scope: per-task - The example optimizes for a supplied task and prompt. Generated skills can be shared later, but the extraction loop itself is task/run scoped.

Learning timing: offline staged - Traces are captured during execution, then analyzed and distilled across loop iterations before artifacts or generated skills are written.

Distilled form: natural-language symbolic - The durable outputs are optimized prompts, generated SKILL.md natural-language, JSON reference summaries, and pattern records. No parametric model update is implemented in this repository.

Survey placement: this is a trace-learning skill-generation example, not the default static skill activation path. It strengthens the survey claim that trace learning often has a raw trace stage followed by a separate distillation stage, but it should be counted as example/tooling evidence rather than proof that the shipped marketplace continuously self-improves during ordinary use.

Read-back

Read-back: pull - For retained memory that accumulates in this repository, future agents or maintainers read back mechanisms, claims, runs, reports, benchmark results, AGENTS.md preferences, and example artifacts by explicitly opening/searching files or running scripts. Static skill activation by Claude Code is baseline documentation loading and does not count as memory read-back from accumulated state.

The pull path is nevertheless agent-shaped. AGENTS.md names durable workspace facts and operating defaults for future agents, the root skill maps available skill files, the corpus index maps skills to mechanisms/claims, and validators/novelty checks consume registry files as inputs (AGENTS.md, researcher/corpus/index.json, researcher/scripts/novelty_check.py). There is no implemented repository-level hook that pushes a newly relevant mechanism or run finding into an unrelated future agent invocation before the agent or host requests it.

Curiosity Pass

The README frames static skill architecture, but the repository's most interesting memory is operational. The skill marketplace is useful, but the researcher OS has more durable learning machinery: mechanism ledgers, claims, run states, activation fixtures, and benchmark reports.

AGENTS.md is the clearest ordinary workspace memory. It records learned preferences, workspace facts, and operating defaults that future agents may load as instruction. This is simpler than the researcher OS and more directly behavior-shaping, but its update discipline is social rather than strongly validated.

The trace-learning example is more ambitious than the core marketplace. It has a true raw trace -> analysis -> prompt/skill distillation path. Because it lives under examples/, its existence should not be confused with the installed plugin's normal behavior.

The continuous loop is intentionally conservative. loop_step.py parks runs needing evaluation or model/human action. That limits autonomy, but it prevents the common failure mode where a self-improvement loop quietly weakens its own evaluators.

Router quality is treated as a measurable artifact. The benchmark results in the README and router benchmark directory are a rare example of skill-description quality being tested as routing behavior rather than edited by intuition.

What to Watch

Whether the interleaved-thinking trace optimizer is promoted from example code into the main researcher OS or marketplace workflow; that would make trace-learning central rather than ancillary.
Whether researcher/scripts/loop_step.py gains a cost-gated judge adapter for retrieved -> evaluated or proposed -> novelty_checked; that would change the write agency from conservative staging to stronger automatic curation.
Whether generated skills are required to pass the same validate_repo.py, mechanism, claim, corpus-index, and activation-fixture gates before inclusion; that determines whether trace-extracted skills become governed system-definition artifacts or just generated natural-language content.
Whether the plugin host exposes activation logs or feedback to the repository; that would create a real read/write loop from skill use back into skill routing.
Whether benchmark history begins to drive automatic description rewrites or promotion decisions; that would add automatic ranking-to-instruction feedback.

Relevant Notes:

Knowledge storage does not imply contextual activation - frames why static skill activation is not counted as memory read-back from accumulated state.
Axes of artifact analysis - applies to the split between skills, manifests, mechanisms, claims, run states, validators, and trace-extracted generated skills.
Knowledge artifact - classifies source evidence, reports, traces, and benchmark results when they advise later maintainers.
System-definition artifact - classifies skills, manifests, validators, activation fixtures, and promoted mechanisms when they route or constrain future agents.
Lineage - is central to the repository's source-to-skill workflow, claim provenance, ledgers, and trace-extracted example.
Context engineering - names the design problem this skill marketplace and researcher OS both target.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search