Skill Creator Comparison: Claude Code vs Codex

Question

Initial hypothesis:

OpenAI/Codex treats skills as functional references: matter-of-fact technical artifacts for another Codex instance.
Claude Code treats skills as approaches to classes of problems.

Related claim to test against the KB: skill creation is a distillation process, but the source material is broader than notes — it also includes user input, experiments, failures, and product/runtime constraints.

Source packet

Downloaded/copied on 2026-03-23:

sources/codex-skill-creator/ — copied from the local Codex system skill at /home/zby/.codex/skills/.system/skill-creator
sources/claude-code-skill-creator/ — copied from https://github.com/anthropics/skills, path skills/skill-creator/
Theory note: distillation

Quick inventory:

Artifact	Codex	Claude Code
Files in skill folder	9	18
`SKILL.md` lines	416	485
Extra agent instructions	`agents/openai.yaml` only	`agents/grader.md`, `agents/comparator.md`, `agents/analyzer.md`
Scripts	`init_skill.py`, `generate_openai_yaml.py`, `quick_validate.py`	eval runner, benchmark aggregation, description improver, packaging, viewer generation
Primary extra reference	`references/openai_yaml.md`	`references/schemas.md`

The folder shapes already suggest the main difference: Codex packages skill construction and registration; Claude packages skill construction plus a full evaluation loop.

Shared substrate

The two meta-skills agree on the core ontology of what a skill is.

Both treat the description field as the primary trigger surface.
Both use progressive disclosure: metadata always loaded, SKILL.md on trigger, bundled resources on demand.
Both organize bundled resources into scripts/, references/, and assets/.
Both explicitly warn against bloated instructions and prefer explaining why rather than relying on rigid all-caps rules.

The difference is not that one system treats skills as knowledge and the other doesn't. Both treat skills as compressed operational knowledge for a bounded-context agent.

What Codex Distills

The Codex skill-creator is best read as a distillation of artifact construction conventions.

Its process is:

Understand the skill with concrete examples.
Plan reusable contents.
Initialize the folder with init_skill.py.
Edit the skill and bundled resources.
Validate with quick_validate.py.
Iterate from real usage and optional forward-testing.

What this skill mostly teaches is:

how to shape the skill folder
what kinds of reusable resources belong in it
how to keep the prompt lean
how to generate product-specific metadata (agents/openai.yaml)
where the skill should live so Codex can discover it

The methodology is aimed at producing a well-formed reusable artifact — not at running a measurement-heavy optimization loop.

The clearest product-specific signal is the supporting reference sources/codex-skill-creator/references/openai_yaml.md: the extra material covers UI metadata, dependency declarations, and invocation policy. The meta-skill is concerned with how a skill is packaged into the Codex product surface.

What Claude Code Distills

The Claude Code skill-creator is best read as a distillation of iterative skill experimentation.

Its center of gravity is not folder initialization. It is the loop:

capture intent
interview and research
write the draft
create test prompts
run with-skill and baseline executions
draft assertions while runs are executing
grade, benchmark, and review outputs
revise
optimize triggering with should-trigger / should-not-trigger evals

This is why the Claude packet contains far more machinery:

grader, comparator, and analyzer agent instructions
schemas for eval artifacts
benchmark aggregation scripts
trigger-description optimization scripts
an HTML review UI

The distinction is not that Claude's version is "more detailed." It distills a different source: not just "how to write a skill," but "how to learn whether the skill is actually helping."

That source is what the initial hypothesis gestured at with "approaches to problems." The approach encoded here: co-develop with the user, compare against baselines, inspect outputs, quantify what you can, keep humans in the loop, and re-distill from evidence.

Comparison

The initial framing is directionally right but too coarse.

More precise:

Codex skill-creator distills how to build a reusable skill artifact for another Codex instance.
Claude Code skill-creator distills how to iteratively discover, test, and improve a skill with user feedback and benchmarks.

So the split is not:

Codex = technical reference
Claude = problem-solving philosophy

It is closer to:

Codex = artifact-construction distillation
Claude = experimentation-and-evaluation distillation

Both are functional. Claude's function is broader: it treats skill creation as an empirical workshop process rather than mainly a packaging task.

Distillation Implications

This comparison strengthens the KB claim that skill creation is a distillation process, but forces a refinement.

The source for skill creation is not just methodology notes. It is a mixed substrate:

permanent guidance about skill anatomy
product-specific constraints and metadata formats
user intent and trigger phrasings
concrete example tasks
repeated work noticed across runs
observed failures and regressions
benchmark results and blind comparisons
runtime constraints of the host environment

This is still distillation in the KB sense: compress knowledge so a bounded consumer can act. But the source is broader than "notes → skill":

workshop evidence + product constraints + user language + prior methodology → skill

Anthropic's meta-skill makes this explicit — the skill is repeatedly re-distilled from new evidence. Codex's meta-skill acknowledges the same pattern in lighter form through iteration and forward-testing, but does not formalize evidence collection as heavily.

Provisional claim

Skill creation is not a single distillation from methodology into instructions. It is a workshop process of repeated re-distillation from a mixed evidence base, where the final skill is one output and the eval harness / trigger examples / helper scripts are additional distillates.

This suggests a provisional, non-exhaustive two-axis model:

Axis	Question
Artifact distillation	What reusable knowledge/resources should live inside the skill?
Evaluation distillation	What evidence from tests, failures, and user review should survive into the next version?

Codex's skill-creator is stronger on the first axis. Claude Code's is stronger on the second.

A third routing/interface axis may be needed for a full framework: packaging metadata, invocation policy, and trigger optimization are adjacent to artifact distillation but not identical to it.

Implications for Commonplace

The current note skills derive from methodology through distillation is directionally right and already allows artifact-sourced skills, but does not treat evaluation traces, trigger tuning, and product constraints as equally central source material.
A stronger formulation: skills can derive from methodology notes or directly from workshop evidence; the common operation is distillation from a larger operational substrate.
Skill creation itself looks like a missing workshop template. A good workshop packet would likely include:
source prompts/examples
trigger hypotheses
eval prompts
benchmark deltas
failure cases
repeated helper-script candidates
final extracted skill

Open questions

How much of this difference is product philosophy versus product maturity?
Is Codex under-specifying evaluation, or deliberately separating evaluation from skill creation?
Should a Commonplace skill-creation workshop explicitly model both axes: artifact distillation and evaluation distillation?
Does a mature skill always need an adjacent workshop history, even if only the final SKILL.md is shipped?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search