---
name: pipeline-eval
description: System-level evaluation framework for multi-stage LLM pipelines. Scores the pipeline *as a whole* across 8 dimensions — input quality, output quality, prompt design, input-coverage %, sequence optimality, parallelization, fact-grounding, and self-improvement feedback. Complements `deepeval` (which scores single content artifacts) by scoring the pipeline architecture itself. Trigger when user types `/pipeeval-*` commands or asks to "evaluate the pipeline", "audit the agent system", "find bottlenecks", "check input/output coverage", or works with pipeline.json / n8n workflows / multi-agent orchestrations. Use proactively after `/deepeval-run` if the user wants system-level (not artifact-level) findings. Also triggers on the short aliases /eval2 <prompt-or-pipeline> (prompt-design audit, same as /pipeeval-prompt) and /eval3 <pipeline.json> (full end-to-end pipeline eval, same as /pipeeval-full).
---

# pipeline-eval — System-level evaluation for LLM pipelines

## What this skill is for

`deepeval` scores **one content artifact**. `pipeline-eval` scores **the system that produced it**. Different question, different rubric, complementary.

Use `pipeline-eval` when:
- The user has a multi-stage pipeline (`pipeline.json`, n8n workflow, LangGraph, custom orchestrator)
- They want to know: *where does quality leak?* — at input, at the prompt, at sequencing, at fact-grounding, somewhere else
- They want to compare two pipeline versions (A vs B)
- They want a self-improvement loop that learns from past runs

Use `deepeval` (not this) when:
- The user has one strategic artifact and wants a BCG-style scorecard
- The output is the question, not the system

## The 8 evaluation modes

| Mode | Command | What it checks |
|---|---|---|
| 1. Input Quality | `/pipeeval-input <stage>` | Are stage inputs well-formed, complete, useful? Probes with adversarial noise. |
| 2. Output Quality | `/pipeeval-output <stage>` | Wraps `deepeval` Tier-2 BCG rubric + adds downstream-compatibility check. |
| 3. Prompt Design | `/pipeeval-prompt <prompt>` | Applies BCG 8-dim in `prompt_design_quality` mode — scores prompt text itself. |
| 4. Input Coverage | `/pipeeval-coverage <stage>` | % of required-info present; criticality map (what breaks when missing). |
| 5. Sequence Optimality | `/pipeeval-sequence <pipeline.json>` | Detects needless serialization, missing deps, redundant stages. |
| 6. Parallelization Audit | `/pipeeval-parallel <pipeline.json>` | Builds DAG, identifies parallelizable buckets, computes parallel efficiency. |
| 7. Fact Grounding | `/pipeeval-facts <output>` | Each numerical/factual claim → source. % grounded vs hallucinated. |
| 8. Self-Improvement | `/pipeeval-learn <feedback>` | Feeds outcomes back into the pipeline — A/B framework, drift detection. |

End-to-end run: `/pipeeval-full <pipeline.json>` — runs all 8 in dependency order.

## Short aliases (eval1 / eval2 / eval3)

Three short shortcut commands give a 3-tier eval ladder — pick the one matching how much you want to investigate:

| Alias | Maps to | Scope | Cost |
|---|---|---|---|
| `/eval1 <artifact>` | `deepeval` skill's `/deepeval-run` | **Quick.** Single artifact, BCG 8-dim Tier-2 verdict. | ~1 LLM call |
| `/eval2 <prompt-or-pipeline>` | `/pipeeval-prompt` (Mode 3) | **Medium.** Prompt-design audit — score one or many prompts against BCG rubric in `prompt_design_quality` mode. Parallel-first for >5 prompts. | ~1 LLM call per prompt |
| `/eval3 <pipeline.json>` | `/pipeeval-full` (all 8 modes) | **Full.** End-to-end pipeline audit — parse → DAG → per-stage input/output/prompt/coverage/facts → self-improvement rollup. | Many parallel agents |

`/eval1` is owned by the `deepeval` skill; `/eval2` and `/eval3` are owned here. The aliases are stable shortcuts — the underlying long-form commands (`/deepeval-run`, `/pipeeval-prompt`, `/pipeeval-full`) remain canonical.

## Architectural rules (inherited from deepeval)

1. **No external LLM SDK calls.** When Claude-as-judge is needed, the orchestrator prepares a prompt to disk; Claude (the runtime) reads it and emits the verdict JSON inline. No `anthropic.messages.create` / `openai.chat.completions.create` / `genai.generate_content`.
2. **BCG rubric anchors are reused VERBATIM** for the output-quality and prompt-design modes — same calibration language as `deepeval`.
3. **Cadence rule:** no metric bucket exceeds 30 days. The Self-Improvement mode uses Day/Week/30-day windows only — never 90/180/365.
4. **Parallel-first execution:** when a mode can run on N items independently (e.g., scoring 51 prompts), spawn N agents in parallel by default. Only serialize when there is an explicit dependency.

## Execution pattern per mode

Each mode follows the same 4-step pattern:

```
1. scripts/probe_<N>_<mode>.py     ← deterministic: extract structure, build manifest
2. scripts/prepare_judge_<N>.py    ← compose Claude-as-judge prompt to disk
3. Claude reads prompt + emits JSON verdict inline  (no SDK call)
4. scripts/aggregate_<N>.py        ← roll up verdicts into report
```

For modes 4, 5, 6 (coverage / sequence / parallelization) — purely deterministic: step 3 is skipped, the orchestrator's static analysis is the answer.

For modes 1, 2, 3, 7 — Claude-as-judge required.

For mode 8 — combines outputs from modes 1-7 over time + applies drift detection.

## Reference files (read before applying a mode)

| File | Read when |
|---|---|
| `references/1-input-quality.md` | running `/pipeeval-input` |
| `references/2-output-quality.md` | running `/pipeeval-output` (delegates to deepeval) |
| `references/3-prompt-design.md` | running `/pipeeval-prompt` |
| `references/4-input-coverage.md` | running `/pipeeval-coverage` |
| `references/5-sequence-optimality.md` | running `/pipeeval-sequence` |
| `references/6-parallelization-audit.md` | running `/pipeeval-parallel` |
| `references/7-fact-grounding.md` | running `/pipeeval-facts` |
| `references/8-self-improvement.md` | running `/pipeeval-learn` |
| `references/output-format.md` | always — defines the universal verdict schema |

## Output discipline

- Concise. Each mode emits JSON. Combine into one `pipeline-verdict.md` at the end.
- Russian for user-facing chat; English for code, configs, file content (project rule).
- Every claim in a verdict requires an `evidence` field with a quote from the source (same anti-fabrication rule as deepeval).

## Composability with deepeval

`pipeline-eval` and `deepeval` are designed to compose:
- `/deepeval-run <artifact>` → scores one artifact (content-quality)
- `/pipeeval-output <stage>` → wraps the above + adds downstream-compatibility check
- `/pipeeval-full <pipeline.json>` → calls `/deepeval-run` for every stage's output automatically

A user who has installed `deepeval` first gets the artifact-level eval; installing `pipeline-eval` adds the system-level eval. Neither replaces the other.