Skill Creator Comparison: Claude Code vs Codex
Question
Initial hypothesis:
- OpenAI/Codex treats skills as functional references: matter-of-fact technical artifacts for another Codex instance.
- Claude Code treats skills as approaches to classes of problems.
Related claim to test against the KB: skill creation is a distillation process, but the source material is broader than notes — it also includes user input, experiments, failures, and product/runtime constraints.
Source packet
Downloaded/copied on 2026-03-23:
sources/codex-skill-creator/— copied from the local Codex system skill at/home/zby/.codex/skills/.system/skill-creatorsources/claude-code-skill-creator/— copied fromhttps://github.com/anthropics/skills, pathskills/skill-creator/- Theory note: distillation
Quick inventory:
| Artifact | Codex | Claude Code |
|---|---|---|
| Files in skill folder | 9 | 18 |
SKILL.md lines |
416 | 485 |
| Extra agent instructions | agents/openai.yaml only |
agents/grader.md, agents/comparator.md, agents/analyzer.md |
| Scripts | init_skill.py, generate_openai_yaml.py, quick_validate.py |
eval runner, benchmark aggregation, description improver, packaging, viewer generation |
| Primary extra reference | references/openai_yaml.md |
references/schemas.md |
The folder shapes already suggest the main difference: Codex packages skill construction and registration; Claude packages skill construction plus a full evaluation loop.
Shared substrate
The two meta-skills agree on the core ontology of what a skill is.
- Both treat the
descriptionfield as the primary trigger surface. - Both use progressive disclosure: metadata always loaded,
SKILL.mdon trigger, bundled resources on demand. - Both organize bundled resources into
scripts/,references/, andassets/. - Both explicitly warn against bloated instructions and prefer explaining why rather than relying on rigid all-caps rules.
The difference is not that one system treats skills as knowledge and the other doesn't. Both treat skills as compressed operational knowledge for a bounded-context agent.
What Codex Distills
The Codex skill-creator is best read as a distillation of artifact construction conventions.
Its process is:
- Understand the skill with concrete examples.
- Plan reusable contents.
- Initialize the folder with
init_skill.py. - Edit the skill and bundled resources.
- Validate with
quick_validate.py. - Iterate from real usage and optional forward-testing.
What this skill mostly teaches is:
- how to shape the skill folder
- what kinds of reusable resources belong in it
- how to keep the prompt lean
- how to generate product-specific metadata (
agents/openai.yaml) - where the skill should live so Codex can discover it
The methodology is aimed at producing a well-formed reusable artifact — not at running a measurement-heavy optimization loop.
The clearest product-specific signal is the supporting reference sources/codex-skill-creator/references/openai_yaml.md: the extra material covers UI metadata, dependency declarations, and invocation policy. The meta-skill is concerned with how a skill is packaged into the Codex product surface.
What Claude Code Distills
The Claude Code skill-creator is best read as a distillation of iterative skill experimentation.
Its center of gravity is not folder initialization. It is the loop:
- capture intent
- interview and research
- write the draft
- create test prompts
- run with-skill and baseline executions
- draft assertions while runs are executing
- grade, benchmark, and review outputs
- revise
- optimize triggering with should-trigger / should-not-trigger evals
This is why the Claude packet contains far more machinery:
- grader, comparator, and analyzer agent instructions
- schemas for eval artifacts
- benchmark aggregation scripts
- trigger-description optimization scripts
- an HTML review UI
The distinction is not that Claude's version is "more detailed." It distills a different source: not just "how to write a skill," but "how to learn whether the skill is actually helping."
That source is what the initial hypothesis gestured at with "approaches to problems." The approach encoded here: co-develop with the user, compare against baselines, inspect outputs, quantify what you can, keep humans in the loop, and re-distill from evidence.
Comparison
The initial framing is directionally right but too coarse.
More precise:
- Codex
skill-creatordistills how to build a reusable skill artifact for another Codex instance. - Claude Code
skill-creatordistills how to iteratively discover, test, and improve a skill with user feedback and benchmarks.
So the split is not:
- Codex = technical reference
- Claude = problem-solving philosophy
It is closer to:
- Codex = artifact-construction distillation
- Claude = experimentation-and-evaluation distillation
Both are functional. Claude's function is broader: it treats skill creation as an empirical workshop process rather than mainly a packaging task.
Distillation Implications
This comparison strengthens the KB claim that skill creation is a distillation process, but forces a refinement.
The source for skill creation is not just methodology notes. It is a mixed substrate:
- permanent guidance about skill anatomy
- product-specific constraints and metadata formats
- user intent and trigger phrasings
- concrete example tasks
- repeated work noticed across runs
- observed failures and regressions
- benchmark results and blind comparisons
- runtime constraints of the host environment
This is still distillation in the KB sense: compress knowledge so a bounded consumer can act. But the source is broader than "notes → skill":
workshop evidence + product constraints + user language + prior methodology → skill
Anthropic's meta-skill makes this explicit — the skill is repeatedly re-distilled from new evidence. Codex's meta-skill acknowledges the same pattern in lighter form through iteration and forward-testing, but does not formalize evidence collection as heavily.
Provisional claim
Skill creation is not a single distillation from methodology into instructions. It is a workshop process of repeated re-distillation from a mixed evidence base, where the final skill is one output and the eval harness / trigger examples / helper scripts are additional distillates.
This suggests a provisional, non-exhaustive two-axis model:
| Axis | Question |
|---|---|
| Artifact distillation | What reusable knowledge/resources should live inside the skill? |
| Evaluation distillation | What evidence from tests, failures, and user review should survive into the next version? |
Codex's skill-creator is stronger on the first axis. Claude Code's is stronger on the second.
A third routing/interface axis may be needed for a full framework: packaging metadata, invocation policy, and trigger optimization are adjacent to artifact distillation but not identical to it.
Implications for Commonplace
- The current note skills derive from methodology through distillation is directionally right and already allows artifact-sourced skills, but does not treat evaluation traces, trigger tuning, and product constraints as equally central source material.
- A stronger formulation: skills can derive from methodology notes or directly from workshop evidence; the common operation is distillation from a larger operational substrate.
- Skill creation itself looks like a missing workshop template. A good workshop packet would likely include:
- source prompts/examples
- trigger hypotheses
- eval prompts
- benchmark deltas
- failure cases
- repeated helper-script candidates
- final extracted skill
Open questions
- How much of this difference is product philosophy versus product maturity?
- Is Codex under-specifying evaluation, or deliberately separating evaluation from skill creation?
- Should a Commonplace skill-creation workshop explicitly model both axes: artifact distillation and evaluation distillation?
- Does a mature skill always need an adjacent workshop history, even if only the final
SKILL.mdis shipped?