--- name: os-evolution-verifier plugin: agent-agentic-os description: > Verifies that os-architect actually causes evolution — not just words. Dispatches os-architect in single-shot simulation mode for a given test scenario, then checks for real artifact presence (new files, HANDOFF_BLOCK, plan files). Reports PASS / FAIL with grep evidence. Accumulates results into a test report. Use after any changes to os-architect, os-evolution-planner, or improvement-intake-agent. argument-hint: "[test-scenario-file | all]" tools: ["Bash", "Read", "Write"] --- ## Overview After evolving os-architect or its downstream agents, you need proof that the changes actually work. This skill dispatches os-architect in single-shot simulation mode for each test scenario and verifies artifact presence — not by reading the transcript, but by checking that expected files exist or expected content appears in output. **Evolution is verified by artifact presence, not by transcript review.** --- ## Artifact Verification Table | Evolution Type | What to Check | |---|---| | Path C (Gap Fill) | `SKILL.md` present at expected path | | Path B (Update) | `tasks/todo/-plan.md` AND `tasks/todo/copilot_prompt_.md` written | | Path A+ (No-op) | No new files written; HANDOFF_BLOCK contains `STATUS: complete` | | Category 3 (Lab Setup) | `improvement/run-config.json` written AND HANDOFF_BLOCK emitted | | HANDOFF_BLOCK integrity | All 7 fields present: INTENT, TARGET, PATH, DISPATCH, STATUS, OUTPUTS, NEXT_ACTION | | Confidence model | Low confidence prompt → clarifying question appears before Phase 2 audit | --- ## Phase 1 — Resolve Test Inputs If invoked with `all`, find test scenarios: ```bash ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort ``` If invoked with a specific file, verify it exists and is valid JSON with required fields: ```bash python3 -c " import json, sys d = json.load(open('$SCENARIO_FILE')) required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check'] missing = [f for f in required if f not in d] if missing: print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1) print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}') " ``` If no scenarios found and no file given, report: > "No test scenarios found. Create scenario JSON files in `temp/os-evolution-verifier/scenarios/` > or run the red-team-bundler to generate them from `os-architect-agent.md`." --- ## Phase 2 — Dispatch os-architect (Single-Shot Simulation) For each scenario, dispatch os-architect via Copilot CLI in simulation mode. The system prompt is the full content of `plugins/agent-agentic-os/agents/os-architect-agent.md`. The user turn is the scenario prompt. ```bash # 1. Heartbeat (free model — always first) python3 plugins/copilot-cli/scripts/run_agent.py \ /dev/null /dev/null temp/os-evolution-verifier/heartbeat.md \ "HEARTBEAT CHECK: Respond HEARTBEAT_OK only." # Confirm heartbeat before dispatching grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \ { echo "HEARTBEAT FAILED — aborting test run"; exit 1; } # 2. Dispatch os-architect in single-shot simulation mode OUTPUT_FILE="temp/os-evolution-verifier/output_${SCENARIO_ID}.md" python3 plugins/copilot-cli/scripts/run_agent.py \ plugins/agent-agentic-os/agents/os-architect-agent.md \ /dev/null \ "$OUTPUT_FILE" \ "$SCENARIO_PROMPT" \ claude-sonnet-4.6 ``` Wait for completion. Check output file is non-empty (expect 100+ lines for a real run): ```bash wc -l "$OUTPUT_FILE" ``` --- ## Phase 3 — Artifact Verification Run the artifact check specified in the scenario's `artifact_check` field. ### HANDOFF_BLOCK integrity check ```bash # All 7 required fields must appear in output FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:") MISSING=() for field in "${FIELDS[@]}"; do grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field") done if [ ${#MISSING[@]} -eq 0 ]; then echo "PASS: HANDOFF_BLOCK has all 7 required fields" else echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}" fi ``` ### File existence check (Path B/C) ```bash # Check for expected artifact files written by os-evolution-planner EXPECTED_FILE="$ARTIFACT_PATH" if [ -f "$EXPECTED_FILE" ]; then echo "PASS: Artifact found at $EXPECTED_FILE" wc -l "$EXPECTED_FILE" else echo "FAIL: Expected artifact not found: $EXPECTED_FILE" fi ``` ### No-op check (Path A+) ```bash # Verify STATUS: complete in HANDOFF_BLOCK and no new plan files created grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete" PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l) [ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created" ``` ### Confidence model check ```bash # Low confidence prompt must produce a clarifying question before Phase 2 grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low" # Check that Phase 2 audit was NOT started (no "Checking existing" or "audit" language before clarification) CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1) AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1) [ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \ echo "PASS: Clarifying question appeared before audit" || \ echo "FAIL: Audit started before clarifying question" ``` --- ## Phase 4 — Record Result Append to `temp/os-evolution-verifier/test-report.md`: ```markdown ## $SCENARIO_ID — $SCENARIO_NAME **Status**: [PASS | FAIL] **Path**: [A / A+ / B / C] **Prompt**: `$SCENARIO_PROMPT` **Artifact check**: $ARTIFACT_CHECK_COMMAND **Evidence**: ``` [grep or file-exists output] ``` **Failure mode tested**: $FAILURE_MODE **Time**: $ELAPSED seconds --- ``` --- ## Phase 5 — Summary Report After all scenarios run, write summary to `temp/os-evolution-verifier/test-report.md`. Each scenario result uses the structured EVOLUTION_VERIFICATION block: ``` ## EVOLUTION_VERIFICATION SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id] SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected] STATUS: [complete | intentional_pause | crashed] PATH: [A | A+ | B | C | pending] OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field] OUTPUTS_VERIFIED: [N — count that passed artifact check] OUTPUTS_MISSING: [list of missing file paths, or "none"] HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false] SCAFFOLD_VALID: [true | false | N/A] PLAN_WRITTEN: [true | false | N/A] DISPATCH_RAN: [true | false | N/A] VERDICT: [PASS | PARTIAL | FAIL] NOTES: [any file-level anomalies or ordering violations] ``` **STATUS field values — required, disambiguates SESSION_COMPLETE: false:** | STATUS | When to use | VERDICT | |--------|-------------|---------| | `complete` | SESSION_COMPLETE: true; HANDOFF_BLOCK present and valid | PASS or PARTIAL | | `intentional_pause` | SESSION_COMPLETE: false; agent asked a clarifying question or hit a documented HARD-GATE; output > 50 lines | PASS (gate behavior is correct) | | `crashed` | SESSION_COMPLETE: false; output < 50 lines, no clarifying question, no HANDOFF_BLOCK, or run_agent.py returned non-zero | FAIL | When `SESSION_COMPLETE: false` and `STATUS: intentional_pause`, `HANDOFF_BLOCK_VALID` must be `N/A` — a missing HANDOFF_BLOCK is expected behavior, not a schema violation. When `SESSION_COMPLETE: false` and `STATUS: crashed`, `VERDICT` must be `FAIL` regardless of any other fields — a silent crash must never be reported as PARTIAL or PASS. Use **PARTIAL** when some outputs are present but not all — it pinpoints exactly which workstream failed rather than collapsing everything into a binary pass/fail. ### Binary PASS/FAIL Contract A run PASSES only if ALL of the following are true: - At least 1 artifact is present at a declared OUTPUTS path - HANDOFF_BLOCK contains all 7 required fields - STATUS is not `crashed` - EVOLUTION_VERIFICATION VERDICT is PASS (PARTIAL counts as FAIL for gating — logged but does not unblock pipeline) A run FAILS if any condition above is not met, OR if VERDICT is PARTIAL. PARTIAL means outputs are incomplete — this is a FAIL for any gating decision, even though it is logged separately for diagnostic purposes. **Adversarial threshold:** When running WS-N failure injection scenarios (N-01 through N-06), the verifier must produce FAIL verdicts on at least 4 of 6 adversarial inputs. A verifier that passes all adversarial inputs is not operational — it is only checking the happy path. **Critical scenario requirement**: N-04 (malformed run-config), N-05 (truncated plan), and N-06 (bad evals schema) MUST ALL produce FAIL verdicts. These test structural failures, not just crashes. A verifier that catches crashes (N-01/N-02/N-03) but misses structural failures (N-04/N-05/N-06) has a ceiling of 3/6 and is not detecting the important failure modes. Follow with the aggregate summary: ``` ## Run Summary Total: N scenarios PASS: X PARTIAL: Y FAIL: Z ### Failed / Partial Tests - TEST-N: — ### Evolution Gaps Found [For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case] ### Recommended Actions 1. [Priority: Critical] Fix in os-architect-agent.md 2. [Priority: High] Add new eval case for 3. [Priority: Medium] Create new skill for ``` --- ## Phase 6 — Persist to Experiment Log After Phase 5 summary is written, always call os-experiment-log to persist the run: ```bash python3 scripts/experiment_log.py append \ --report temp/os-evolution-verifier/test-report.md \ --triggered-by os-evolution-verifier ``` This is not optional. `temp/` is ephemeral — if the log is not appended immediately after the run, the results are lost when the shell restarts. The experiment log is the durable record. --- ## Scenario File Format Test scenarios live in `temp/os-evolution-verifier/scenarios/`: ```json { "id": "TEST-1", "name": "Path C — monitoring agent gap fill", "category": 4, "path": "C", "prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.", "expected_artifact": "tasks/todo/copilot_prompt_", "artifact_check": "file_prefix", "expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file", "failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner" } ``` --- ## Smoke Tests Three fast verification cases to confirm the skill itself is working: **Smoke 1 — Heartbeat check**: Run heartbeat only, confirm `HEARTBEAT_OK` in output. Expected: heartbeat.md non-empty, contains `HEARTBEAT_OK`. Time: <30s. **Smoke 2 — Single scenario dry run**: Run `TEST-1` (Path C gap fill). Confirm output file is >100 lines. Time: <3 min. **Smoke 3 — HANDOFF_BLOCK field scan**: On an existing output file, run the 7-field grep. Confirm all 7 fields found. Time: <5s. --- ## Gotchas - **Output files must be >100 lines**: A Copilot CLI call that returns <50 lines usually means the model hit a refusal, the system prompt was too long, or the heartbeat was skipped. Always heartbeat first and always check `wc -l` before running artifact checks. - **Single-shot simulation ≠ real dispatch**: os-architect in simulation mode cannot write files to disk (no Write tool access during simulation). Artifact checks for Path B/C test whether the agent PROPOSES the correct files in its output, not whether they exist on disk. Real file-existence checks only apply when os-architect is run with full tool access. - **HANDOFF_BLOCK field order matters for grep**: Use `grep -q "FIELD:"` not `grep -q "FIELD"` — otherwise partial matches on word fragments will produce false positives. - **Confidence model check is order-sensitive**: The clarifying question must appear BEFORE any audit output. Line-number comparison is required; simple `grep -q` is insufficient. - **`temp/` files are ephemeral — distinguish shell restart from crash**: If a run was interrupted by a shell restart and `temp/copilot_output_*.md` is missing, set `STATUS: intentional_pause`, `VERDICT: PARTIAL (inconclusive)` — the run never completed. If the file is present but < 50 lines AND run_agent.py returned non-zero, set `STATUS: crashed`, `VERDICT: FAIL` — the agent halted unexpectedly. Never report a silent crash as PARTIAL. - **OUTPUTS field path normalization**: HANDOFF_BLOCK OUTPUTS lists paths relative to project root. Normalize before checking (strip leading `./`, resolve `~`). A path mismatch between declared and actual is a schema drift signal, not a file-missing signal. - **Category 5 tests produce two sequential dispatches**: When verifying Category 5 output, check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.