---
name: eval
description: Eval-driven development — capture real failures from the ticket-implementation workflow, convert them into regression tests, track quality metrics over time, and use LLM-as-a-Judge to assess implementations. Use this skill when the user asks to capture a failure, run evals, judge an implementation, or generate a quality report.
---

# Eval Skill

**For Claude Code AI Assistant**

This skill provides eval-driven development for capturing failures, running regression tests, judging implementations, and tracking quality over time.

## Purpose

The eval skill closes the feedback loop in the ticket-implementation workflow by:
- Capturing real failures (test failures, review rejections, missed criteria) as eval cases
- Running regression tests against accumulated eval cases
- Using LLM-as-a-Judge to assess implementation quality against acceptance criteria
- Generating quality reports with trends and actionable recommendations

## Storage Layout

```
.artifex/evals/
  cases/          # Individual eval case JSON files
  reports/        # Generated summary reports
```

## Modes of Operation

### 1. Capture Mode (`/afx-eval capture`)

After a failed review or test failure, record an eval case.

**When to use**: Immediately after a ticket implementation fails review, tests fail, or acceptance criteria are missed.

**Process**:

1. Initialize eval storage if it does not exist:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh init
   ```

2. Create an eval case skeleton:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh capture <ticket_id> <failure_reason>
   ```
   Where `<failure_reason>` is one of: `test_failure`, `review_rejection`, `acceptance_criteria_miss`

3. The script creates a JSON file in `.artifex/evals/cases/` with the skeleton. After creation, fill in the details by reading the ticket and the failed implementation:

   - **acceptance_criteria**: Pull from the ticket database using `ticket-manager.sh show <ticket_id>`
   - **implementation_summary**: Summarize what was implemented (from the git diff or code review)
   - **failure_details**: Describe what went wrong — the specific test failure message, review comment, or missed criterion
   - **resolution**: Once the issue is fixed, record how it was resolved

4. Update the JSON file with the gathered details.

5. Confirm to the user:
   - Eval case ID
   - Ticket ID
   - Failure type
   - File location

**Eval case JSON format**:
```json
{
  "id": "EVAL-001",
  "ticket_id": "2.3",
  "timestamp": "2026-04-11T21:30:00Z",
  "failure_type": "test_failure|review_rejection|acceptance_criteria_miss",
  "acceptance_criteria": ["Given... When... Then..."],
  "implementation_summary": "What was implemented",
  "failure_details": "What went wrong",
  "resolution": "How it was fixed",
  "status": "captured|resolved|regression"
}
```

### 2. Run Mode (`/afx-eval run`)

Execute all eval cases and report on quality.

**When to use**: Periodically to check regression status, before releases, or after significant changes.

**Process**:

1. Read all eval case files from `.artifex/evals/cases/`:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh list
   ```

2. For each eval case, assess whether the failure would still occur:
   - If the ticket's code has been fixed and the resolution field is populated, check whether the same pattern of failure exists elsewhere in the codebase
   - If a similar failure is found, mark as `regression`
   - If the fix holds, mark as `resolved`
   - If no resolution yet, keep as `captured`

3. Generate a run report:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh stats
   ```

4. Present results to the user:
   - **Pass/fail rates**: Overall and by failure category (`test_failure`, `review_rejection`, `acceptance_criteria_miss`)
   - **Common failure patterns**: Group similar failures and identify recurring themes (e.g., "missing null checks", "incomplete error handling", "missing edge cases")
   - **Quality trends**: Compare current stats to previous reports — is the pass rate improving or degrading?

5. For any new regressions found, recommend specific actions.

### 3. Judge Mode (`/afx-eval judge <ticket_id>`)

LLM-as-a-Judge assessment of an implementation against its acceptance criteria.

**When to use**: After implementing a ticket, before marking it done, to get an independent quality assessment.

**Process**:

1. Read the ticket's acceptance criteria:
   ```bash
   ~/.claude/skills/afx-ticket-manager/ticket-manager.sh show <ticket_id>
   ```

2. Read the implementation. Get the git diff for the ticket's commit(s):
   ```bash
   git log --oneline --all --grep="<ticket_id>" | head -5
   git diff <commit>^..<commit>
   ```
   If no commit is found, use the current staged/unstaged changes.

3. For each acceptance criterion, assess:
   - **PASS**: The implementation clearly satisfies this criterion. Provide a brief explanation of how.
   - **FAIL**: The implementation does not satisfy this criterion. Explain what is missing or incorrect.

4. Calculate an overall quality score:
   - Count PASS vs FAIL criteria
   - Score = (PASS count / total criteria) * 100
   - Rate: 90-100% = Excellent, 70-89% = Good, 50-69% = Needs Work, <50% = Major Issues

5. Provide suggestions:
   - For each FAIL criterion, suggest what needs to change
   - Note any code quality concerns (error handling, edge cases, naming, structure)
   - Highlight any acceptance criteria that are ambiguous and need clarification

6. Present the full assessment to the user in a clear table format.

### 4. Report Mode (`/afx-eval report`)

Generate a comprehensive quality summary report.

**When to use**: At milestone boundaries, sprint reviews, or when assessing overall project quality.

**Process**:

1. Gather all eval case data:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh stats
   ```

2. Generate a report:
   ```bash
   ~/.claude/skills/afx-eval/eval-manager.sh report
   ```

3. The report includes:
   - **Summary**: Total eval cases, pass rate (resolved / total), fail rate
   - **Breakdown by failure type**: How many of each type (test_failure, review_rejection, acceptance_criteria_miss)
   - **Top failure patterns**: Analyze failure_details across all cases and identify the most common themes
   - **Timeline**: When failures were captured and how quickly they were resolved
   - **Recommendations**: Based on the patterns, provide 3-5 actionable suggestions for improving future implementations

4. The report is saved to `.artifex/evals/reports/report-<date>.md`.

5. Present the report summary to the user and note the file location.

## Available Shell Commands

All commands are run through the eval-manager script:

| Command | Description |
|---------|-------------|
| `eval-manager.sh init` | Create `.artifex/evals/` directory structure |
| `eval-manager.sh capture <ticket_id> <failure_reason>` | Create eval case JSON skeleton |
| `eval-manager.sh list` | List all eval cases with status |
| `eval-manager.sh stats` | Show aggregate statistics |
| `eval-manager.sh report` | Generate summary report to `.artifex/evals/reports/` |

## Integration with ticket-implementation Workflow

The eval skill integrates at these points in the ticket-implementation workflow:

- **After test failure**: Use `/afx-eval capture <ticket_id> test_failure` to record what went wrong
- **After review rejection**: Use `/afx-eval capture <ticket_id> review_rejection` to record the review feedback
- **After acceptance criteria miss**: Use `/afx-eval capture <ticket_id> acceptance_criteria_miss` to record which criteria were not met
- **Before marking done**: Use `/afx-eval judge <ticket_id>` to get an independent quality assessment
- **At milestones**: Use `/afx-eval report` to review quality trends

## Error Handling

The skill handles these error conditions:
- Storage not initialized: Automatically runs `init` when needed
- Missing ticket: Reports that the ticket ID was not found
- No eval cases: Reports that no cases exist yet
- Invalid failure type: Shows valid options

## Dependencies

- **Required**: jq, bash 4.0+
- **Optional**: ticket-manager skill (for pulling acceptance criteria automatically)
- **Optional**: git (for judge mode, to read implementation diffs)

## Version

1.0.0
