---
name: extend-skill-builder
description: "Shape a skill before skill-creator builds it. Discovery-first: observe how the model already behaves, then write only the delta. Use when starting a new skill or rewriting one from scratch."
---

# Extend Skill Builder

skill-creator builds, iterates, evaluates, and packages skills. This skill runs *before* that — it does discovery, classification, and scope-setting, then hands off a well-shaped draft to skill-creator.

**The doctrine:** A skill contains only what the model couldn't figure out on its own. The skill is a residual — what's left after subtracting what the unskilled model already does correctly. Length isn't a target; necessity is.

## When to stop and hand off

After step 5 below, run `skill-creator` for eval iteration, description optimization, and packaging. This skill does not invoke those — it produces the inputs.

**Skip skill-creator entirely** when:
- Classification is **Workflow** or **Well-known domain** with no objectively gradeable output. The eval loop adds nothing; ship the draft.

Run skill-creator when:
- Classification is **API wrapper** or **Simple utility** with verifiable outputs, or any skill where you want quantitative validation and trigger optimization.

---

## Step 1: Observe baseline behavior

Before drafting, run the target task through the model with no skill loaded. Spawn a fresh subagent (or use skill-creator's baseline spawn) with 3–5 realistic inputs spanning:
- **Case ambiguity:** expected-pass, expected-fail, ambiguous
- **Reasoning load:** straightforward, moderate, complex

Save outputs. For each one, label: what was right, what was wrong, what was missing.

**This is the most important step.** Everything in the final skill must trace back to a specific gap observed here. If you skip baseline observation, you'll write a skill that restates what the model already does — pure token waste.

No spawn available? Interview the user: *"What would make you delete this tool?"* Each failure implies a criterion. Then flip: ask for 3–5 positive examples. Derive criteria from examples, never from abstract definitions.

## Step 2: Classify

The classification determines architecture. skill-creator treats all skills uniformly; we don't.

| Type | Definition | Architecture |
|------|-----------|--------------|
| **Workflow** | Multi-step, multi-tool orchestration | Numbered steps composing existing tools inline. No scripts. |
| **Well-known domain** | gh, aws, kubectl, docker, terraform | Body says "use [tool]" plus org-specific tips and gotchas. No scripts. |
| **API wrapper** | REST, CRUD over an external service | Single CLI with Unix subcommands + transport script. JSON stdout, stderr errors, --help. |
| **Simple utility** | Single resource, simple HTTP, deterministic | Inline curl in body. No scripts. |

**Rule:** If a well-known CLI covers 80%+ of the use case, classify as well-known domain regardless of surface complexity.

## Step 3: Extract acceptance criteria

From the labeled outputs in step 1, identify the minimum bar per dimension. These become binary expectations — one plain-language statement per distinguishable failure mode.

- Hypothesis-killer first: the expectation most likely to fail if the skill is wrong.
- Stop when additional expectations wouldn't change your confidence in whether to ship. Not at a target count.
- Conflicts: classify as hard (non-negotiable) or soft (trade-off acceptable).

**Hard gate — verify red:** Run each expectation against the raw step-1 output. Every expectation must fail on the unskilled baseline. A green expectation never seen failing is unfalsified — rewrite or delete it before continuing.

**Gaming test:** For each expectation, describe a wrong implementation that passes. If you can, the expectation is gameable. *Example: "Output contains a PR URL" is gameable by hardcoding a URL — fix by asserting the URL differs across inputs.*

Save the surviving expectations to `evals/evals.json` in skill-creator format — they become the spec skill-creator validates against.

## Step 4: Draft the skill as a residual

Write only what the baseline observation showed was missing. The draft has three legitimate ingredients:

1. **Outcomes** — what success looks like, in terms the model can recognize
2. **Why** — the reasoning behind any non-obvious constraint
3. **Guardrails** — gotchas that only surface through observed failure

It does NOT contain:
- CLI syntax for tools the model already knows
- Step-by-step procedures for well-known operations
- Defensive instructions for failures that didn't happen in step 1
- Restated defaults (JSON formatting, error handling, politeness)

**Frontmatter:** Only `name` and `description` are required. Description is the primary trigger — make it specific about *when* to use, not just *what*. Add `argument-hint`, `allowed-tools`, or `disable-model-invocation` only when requirements demand them.

**Length target:** ~200 lines for the SKILL.md body. If you're approaching that, extract domain-specific detail into `references/<topic>.md` and link from the body with a one-line pointer.

**Output location:** `~/.claude/skills/<skill-name>/` (personal) by default. Use `.claude/skills/<skill-name>/` (project) only when the skill is tightly coupled to a specific repo. Never write to a bare relative path — it resolves against the plugin cache.

## Step 5: Iterate the draft against the criteria

Run all step-3 expectations against the drafted skill. Note HOW each failure occurs — wrong format, wrong content, scope drift, tone. The failure mode tells you what to add.

Fix the highest-impact failure with the minimum delta: one sentence, one example, one constraint. Rerun ALL expectations — new instructions can break passing ones.

Most skills converge in 1–2 rounds. Past round 3, the criteria are probably wrong — go back to step 3.

**Compression pass when green:** Delete each instruction one at a time. Rerun expectations. Still pass? Keep it deleted. Simplify paragraphs to sentences. Every surviving token must earn its cost.

## Step 6: Hand off to skill-creator

Provide:
- The drafted `SKILL.md`
- `evals/evals.json` from step 3
- Baseline observation notes from step 1 (so skill-creator's baseline runs are calibrated)

Tell the user the literal next command. skill-creator will run the full eval loop (with-skill vs. baseline subagent runs, grader, viewer, iteration), then the description optimization loop, then packaging.

**Don't validate the skill yourself.** You wrote it. Your results prove nothing. The point of handing off to skill-creator is that an independent agent runs the skill against the criteria you set.

---

## The two levers

When the skill fails an expectation, there are exactly two causes:

1. **The expectation is wrong** — you're measuring the wrong thing. Fix the expectation.
2. **The skill is wrong** — the expectation is right but the output isn't passing. Fix the skill.

Always ask "is the expectation right?" first, especially in early rounds when expectations are least validated. Seeing real output is often the first time you realize what you actually wanted.

## Where we diverge from skill-creator

| Topic | skill-creator | extend-skill-builder | Why |
|-------|---------------|----------------------|-----|
| Order | Draft → test → iterate | Observe → classify → criteria → draft → hand off | Skills should be residuals, not first drafts |
| Length | ~500 lines | ~200 lines | Progressive disclosure; extract to references/ |
| Scripts | No conventions | Unix-style CLIs: subcommands, JSON stdout, stderr, --help | Design for LLM priors |
| Scope philosophy | Add what the model needs | Subtract what it already does | Necessity over completeness |
| Treatment of skill types | Uniform | Classified upfront | Architecture follows from type |