---
name: experiment-set-design
description: Use when designing experiments to test whether a Claude Code skill is effective, or when planning how to validate a new or improved skill
---

# Experiment Set Design

## Core Principle

Compliance doesn't prove quality. Quality requires blind comparison against baseline.

## The Baseline Requirement

**NON-NEGOTIABLE:** Every experiment needs a baseline -- same prompt, same task, no skill loaded. No baseline = invalid experiment.

## Task Selection Checklist

Each task must satisfy ALL of:

- [ ] Skill's guidance should produce **different behavior** on this task
- [ ] Complex enough that **architecture decisions matter** (not a one-liner)
- [ ] **Multiple valid approaches** exist (skill's approach is distinguishable)
- [ ] Covers a **different domain or language** than other tasks in the set

Aim for 3-5 tasks per phase. Start small, add tasks when overfitting is suspected.

## Three-Phase Progression

```
Phase 1: COMPLIANCE         Phase 2: STRESS             Phase 3: QUALITY
"Does it follow?"           "Does it follow under        "Does following help?"
                             pressure?"
PASS/FAIL per rule          PASS/FAIL per rule           Blind review (use
                            + competing instructions      blind-skill-assessment)
                            + edge cases                  Winner per dimension
        |                           |                           |
        v                           v                           v
   All PASS?─── no ──> Fix skill, re-run phase
        |
       yes
        |
        v
   Advance to next phase
```

**Phase 1 -- Compliance.** Check each rule. Binary PASS/FAIL.

**Phase 2 -- Stress.** Same checks + competing instructions ("just write the whole thing"), time pressure, conflicting edge cases.

**Phase 3 -- Quality.** Run with and without skill. Blind-assess using `blind-skill-assessment`. If skill wins on aggregate, advance. If baseline wins consistently, the skill is counterproductive — rethink or abandon it rather than forcing further iterations.

**Advancement:** All PASS before advancing. Any FAIL: improve skill, re-run current phase.

## Task Set Diversity

Vary to prevent overfitting:

- **Domain:** CLI tool, web handler, data pipeline, algorithm
- **Complexity:** Single function, multi-module, concurrent
- **Language:** At least 2 if the skill is language-agnostic

After 2+ improvement cycles on the same task set, add new tasks.

## Revision Tracking

For each run, record: skill version (git hash), tasks run, phase, results, what changed since last run.

## Example: HDD Skill Phases

| Phase | Tasks | Method | Result |
|-------|-------|--------|--------|
| 1: Compliance | 5 tasks (Python, Haskell, Go) | Transcript check: holes visible? One-at-a-time? | 4/5 PASS, 1 FAIL |
| 1b: Re-run | Same 5 after skill edit | Same checks | 5/5 PASS |
| 2: Stress | 5 tasks + competing instructions | Same checks under pressure | 5/5 PASS |
| 3: Quality | 5 tasks, A vs B (blinded) | Blind 3-persona review | A won 4/5 → decode: A=skill |

## Red Flags -- STOP

- Designing experiments with no baseline comparison
- All tasks at the same difficulty or in the same domain
- Jumping to Phase 3 without passing Phase 1
- Same task set for 3+ improvement cycles without adding tasks
- No record of which skill version produced which results

**If you catch yourself doing any of these: STOP. Add baselines. Vary your tasks. Track your versions.**