--- description: Execute YAML test plan, stop on first failure, output rich debug prompt name: run-test-plan disable-model-invocation: true --- # Run Test Plan Execute a YAML test plan, run setup commands, health checks, and each test sequentially. Stop on first failure with rich debug output. ## Prerequisites - **agent-browser skill**: Browser tests require the `agent-browser:agent-browser` skill to be available ## Arguments - `--plan `: Path to test plan (default: `docs/testing/test-plan.yaml`) - `--skip-setup`: Skip setup commands and health checks (for re-running after failure) ## Step 1: Parse Test Plan Read and validate the test plan: ```bash # Resolve plan path from --plan (default shown) PLAN_PATH="${PLAN_PATH:-docs/testing/test-plan.yaml}" # Check file exists ls "$PLAN_PATH" || { echo "Error: Test plan not found: $PLAN_PATH"; exit 1; } # Validate YAML python3 -c "import yaml; yaml.safe_load(open('$PLAN_PATH'))" || { echo "Error: Invalid YAML: $PLAN_PATH"; exit 1; } ``` Extract from the YAML: - `setup.commands`: List of setup commands - `setup.health_checks`: List of URLs to poll - `tests`: Array of test cases ## Step 2: Run Setup (unless --skip-setup) ### 2a. Check Prerequisites If `setup.prerequisites` exists, verify each one: ```bash # For each prerequisite in setup.prerequisites || { echo "Prerequisite not met: "; exit 1; } ``` ### 2b. Set Environment Variables If `setup.env` exists, export each variable. Variables using `${VAR}` syntax should be resolved from the current environment: ```bash # For each key/value in setup.env export ="" ``` ### 2c. Build If `setup.build` exists, execute build commands sequentially: ```bash # For each command in setup.build || { echo "Build failed: "; exit 1; } ``` ### 2d. Start Services If `setup.services` exists, start long-running processes and wait for health checks: ```bash # For each service in setup.services nohup > .beagle/service-.log 2>&1 & echo $! > .beagle/service-.pid ``` For each service with a `health_check`, poll until ready: ```bash timeout= url= elapsed=0 while [ $elapsed -lt $timeout ]; do if curl -s -o /dev/null -w "%{http_code}" "$url" | grep -qE "^(200|301|302)"; then echo "✓ Health check passed: $url" break fi sleep 2 elapsed=$((elapsed + 2)) done if [ $elapsed -ge $timeout ]; then echo "✗ Health check timeout: $url" exit 1 fi ``` ### 2e. Legacy Setup Format If the plan uses the older flat format (`setup.commands` + `setup.health_checks` instead of `prerequisites`/`build`/`services`), fall back to executing `setup.commands` sequentially and polling `setup.health_checks` as before. ## Step 3: Gate — setup ready before tests Do not start **Step 4** until each condition you can check is true: 1. **Plan load:** The file from `--plan` (default `docs/testing/test-plan.yaml`) exists and parses as YAML (same checks as Step 1). 2. **Setup branch:** - If **not** using `--skip-setup`: Every `setup.prerequisites` check that exists exited 0; every `setup.build` command succeeded; every service `health_check` reached HTTP 200, 301, or 302 within its timeout **or** legacy `setup.health_checks` passed after `setup.commands`. - If using `--skip-setup`: Before TC-01, confirm anything the plan still needs is alive—at minimum one successful `curl` (or equivalent) to each URL in `setup.health_checks` or each `setup.services[].health_check.url` that the tests depend on. 3. **Evidence path:** `mkdir -p docs/testing/evidence` succeeds and the directory exists. If any gate fails, stop, fix setup or flags, and do not execute tests. ## Step 4: Execute Tests Sequentially For each test in the plan: ### 4a. Log Test Start ```markdown ## Running: TC-XX - Context: ``` ### 4b. Execute Steps For each step in `test.steps`, determine the step type and execute accordingly: **Shell commands (`run:` steps):** The most common step type. Execute the command via Bash and capture stdout, stderr, and exit code: ```bash # Execute the command, capture output and exit code 2>&1 echo "EXIT_CODE: $?" ``` Capture all output for evaluation in step 4c. Shell steps cover: - CLI binary invocations (e.g., `./target/debug/myapp status --all`) - Database queries (e.g., `psql "${DATABASE_URL}" -c "SELECT ..."`) - File inspection (e.g., `ls -la /path/to/expected/output`) - Process lifecycle checks (e.g., `timeout 5 ./myapp 2>&1 || true`) - Any other command a human would type in a terminal **curl actions (`action: curl` steps):** ```bash curl -X \ -H "Content-Type: application/json" \ \ -d '' \ "" \ -o response.json \ -w "%{http_code}" > status_code.txt # Capture response for evaluation cat response.json cat status_code.txt ``` **agent-browser CLI actions:** Steps starting with `agent-browser` are browser automation commands: ```bash # Navigate agent-browser open # Snapshot interactive elements (always do before interacting) agent-browser snapshot -i # Interact using refs from snapshot output (@e1, @e2, etc.) agent-browser fill @ "" agent-browser click @ # Wait for conditions agent-browser wait --url "" agent-browser wait --text "" agent-browser wait --load networkidle # Capture evidence agent-browser screenshot docs/testing/evidence/.png ``` **Important:** Always run `agent-browser snapshot -i` before interacting with elements to get valid refs, and re-snapshot after navigation or significant DOM changes. Save screenshots to `docs/testing/evidence/.png` ### 4c. Evaluate Result **Gate — artifacts before PASS/FAIL:** - **`run:` steps:** Stdout and stderr captured; exit code recorded (e.g. `EXIT_CODE:` line or equivalent). - **`action: curl` steps:** Response body and HTTP status captured to known paths (e.g. `response.json`, `status_code.txt` or paths the plan specifies). - **agent-browser steps:** After any navigation or DOM change, a fresh `agent-browser snapshot -i` exists before asserting; if the test records evidence, the screenshot file path is created or failure is explicit. Then, using agent reasoning, compare actual outcome against `test.expected`: - Read the expected behavior description - Compare with actual response/screenshot - Determine PASS or FAIL ### 4d. On PASS ```markdown ✓ TC-XX PASSED: ``` Continue to next test. ### 4e. On FAIL Stop immediately. Go to Step 6. ## Step 5: On All Tests Pass ```markdown ## Test Results: ALL PASSED | ID | Name | Result | |----|------|--------| | TC-01 | | ✓ PASS | | TC-02 | | ✓ PASS | | ... | ... | ... | **Total:** N/N tests passed ### Evidence Screenshots saved to `docs/testing/evidence/` ### Cleanup Stopping background services... ``` Clean up: ```bash # Kill background services for pidfile in .beagle/service-*.pid .beagle/dev-server.pid; do if [ -f "$pidfile" ]; then kill $(cat "$pidfile") 2>/dev/null rm "$pidfile" fi done ``` ## Step 6: On Failure - Generate Debug Prompt When a test fails, generate rich debug output: ### 6a. Gather Context ```bash # Get changed files relevant to the failure git diff --name-only $(git merge-base HEAD origin/main)..HEAD # Get recent changes in files mentioned in test.context git diff $(git merge-base HEAD origin/main)..HEAD -- ``` ### 6b. Output Debug Report ```markdown ## Test Failure: TC-XX - ### What Failed **Test:** **Expected:** **Actual:** ### Relevant Changes in This PR - `` (lines X-Y) - ### Evidence - Screenshot: `docs/testing/evidence/.png` - Status code:


- Response body:
```json

```

### Error Details


```

```

### Suggested Investigation


1. 
2. 
3. 

### Debug Session Prompt

Copy this to start a new Claude session:

---
I'm debugging a test failure in branch ``.

**Test:** 
**Error:** 



Relevant files:


Help me investigate why .
---
```

### 6c. Preserve Evidence

```bash
# Ensure evidence directory exists
mkdir -p docs/testing/evidence

# Save failure context
cat > docs/testing/evidence/-failure.md << 'EOF'
# Failure Report: 


EOF
```

### 6d. Cleanup and Exit

```bash
# Kill background services
for pidfile in .beagle/service-*.pid .beagle/dev-server.pid; do
  if [ -f "$pidfile" ]; then
    kill $(cat "$pidfile") 2>/dev/null
    rm "$pidfile"
  fi
done
```

## Test Results Summary Table

Always output a summary table showing progress:

```markdown
## Test Results

| ID | Name | Result |
|----|------|--------|
| TC-01 |  | ✓ PASS |
| TC-02 |  | ✗ FAIL |
| TC-03 |  | - SKIP |

**Passed:** 1/3
**Failed:** TC-02
```

Tests after a failure are marked as SKIP (not executed).

## Verification

Before completing:

```bash
# Verify evidence directory exists
ls -la docs/testing/evidence/

# List captured evidence
ls docs/testing/evidence/*.png docs/testing/evidence/*.md 2>/dev/null
```

**Pass conditions (all must be true to call the run complete):**

- **`ls -la docs/testing/evidence/`** exits 0 (directory exists).
- Every test that ran has an explicit PASS or FAIL (not only “felt right”).
- If any test failed: a failure artifact exists (`docs/testing/evidence/-failure.md` or equivalent) **or** the debug report was emitted with expected vs actual.
- Cleanup ran: no stale `.beagle/service-*.pid` pointing at live processes you started, unless the plan says to leave them up.

**Verification Checklist:**
- [ ] Setup commands executed successfully (or `--skip-setup` with live dependencies confirmed)
- [ ] Health checks passed before test execution (Step 3 gate)
- [ ] Each executed test has recorded result
- [ ] Evidence captured in `docs/testing/evidence/`
- [ ] On failure: debug prompt includes expected vs actual
- [ ] On failure: relevant PR changes listed
- [ ] Background processes cleaned up

## Gates (ordered)

1. **Plan valid:** YAML parses; required keys for your branch (`setup` / `tests`) are present.
2. **Setup healthy:** Step 3 pass conditions met before TC-01.
3. **Per-test artifacts:** Step 4c gate satisfied before marking PASS.
4. **Stop on fail:** First failure → Step 6; later tests = SKIP in the summary table.
5. **Cleanup:** Step 5 or Step 6d executed so background PIDs from this run are released.

## Rules

- Stop on first test failure (do not continue to other tests)
- Always capture evidence (screenshots, responses)
- Include file:line references in debug prompts when possible
- Use `--skip-setup` flag to re-run after fixing issues
- Never hardcode secrets - use environment variables
- Clean up background processes even on failure
- Preserve failure evidence for debugging
- Make debug prompts copy-paste ready for new sessions