---
name: cli-eval
description: Create and run evaluation suites, watch live benchmark progress, view scorecards, compare model performance, and integrate eval runs with CI workflows from the CLI.
---
<!-- generated by src/lib/agentSkills/generator.ts; manual edits will be overwritten -->

## Overview

Create and run evaluation suites, watch live benchmark progress, view scorecards, compare model performance, and integrate eval runs with CI workflows from the CLI.

## Quick install

```bash
npm install -g omniroute   # or: npx omniroute
omniroute --version
```

## Subcommands

### `eval`

**Example:**

```bash
omniroute eval
```

### `eval suites`

**Example:**

```bash
omniroute eval suites
```

### `eval list`

**Example:**

```bash
omniroute eval list
```

### `eval get <suiteId>`

**Example:**

```bash
omniroute eval get <suiteId>
```

### `eval create`

**Flags:**

- `--file <path>`

**Example:**

```bash
omniroute eval create
```

### `eval run <suiteId>`

**Flags:**

- `-m, --model <id>`
- `--combo <name>`
- `--concurrency <n>`
- `--tag <tag>`
- `--watch`

**Example:**

```bash
omniroute eval run <suiteId>
```

### `eval list`

**Flags:**

- `--suite <id>`
- `--status <s>`
- `--since <ts>`
- `--limit <n>`

**Example:**

```bash
omniroute eval list
```

### `eval get <runId>`

**Example:**

```bash
omniroute eval get <runId>
```

### `eval results <runId>`

**Flags:**

- `--failed`

**Example:**

```bash
omniroute eval results <runId>
```

### `eval cancel <runId>`

**Flags:**

- `--yes`

**Example:**

```bash
omniroute eval cancel <runId>
```

### `eval scorecard <runId>`

**Example:**

```bash
omniroute eval scorecard <runId>
```

### `simulate [prompt]`

**Flags:**

- `--file <path>`
- `-m, --model <id>`
- `--combo <name>`
- `--reasoning-effort <level>`
- `--thinking-budget <n>`
- `--explain`

**Example:**

```bash
omniroute simulate [prompt]
```

<!-- skill:custom-start -->
<!-- Migrated from skills/omniroute-cli-eval/SKILL.md (preserved curated content) -->

# OmniRoute — CLI Evals

Requires the `omniroute` CLI. See [CLI entry-point skill](https://raw.githubusercontent.com/diegosouzapw/OmniRoute/main/skills/omniroute-cli/SKILL.md) for install + global flags.

## What are evals?

Evals are automated test suites that score LLM outputs against expected answers or rubrics. OmniRoute stores suites and run results in its local database.

## Eval suites

```bash
omniroute eval suites list                       # List all eval suites
omniroute eval suites list --json                # JSON output

omniroute eval suites get <suiteId>              # Full suite definition
```

### Create a suite

```bash
omniroute eval suites create \
  --name "code-quality" \
  --rubric "exact-match" \
  --samples-file ./samples.jsonl                 # JSONL: {input, expected_output}
```

Rubric options: `exact-match`, `contains`, `llm-judge`, `regex`.

`--samples-file` format (one JSON object per line):

```jsonl
{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to Spanish", "expected_output": "hola"}
```

## Run an eval

```bash
omniroute eval suites run <suiteId> \
  --model claude-sonnet-4-6                      # Run suite against a specific model

omniroute eval suites run <suiteId> \
  --model gpt-4o \
  --watch                                        # Live TUI progress (EvalWatch)
```

The run is asynchronous. Use `--watch` for a live terminal dashboard or poll manually:

```bash
RUN_ID=$(omniroute eval suites run <suiteId> --model claude-sonnet-4-6 --output json | jq -r '.id')
omniroute eval get $RUN_ID
```

## Manage runs

```bash
omniroute eval list                              # List all eval runs
omniroute eval list --json

omniroute eval get <runId>                       # Run details (status, model, score)
omniroute eval results <runId>                   # Per-sample results
omniroute eval scorecard <runId>                 # Full scorecard with pass/fail per sample
omniroute eval cancel <runId>                    # Cancel a running eval
```

## Scorecard output

```bash
omniroute eval scorecard <runId> --output json
```

Response fields per sample:

```json
{
  "id": "sample-1",
  "score": 0.95,
  "passed": true,
  "input": "What is 2+2?",
  "output": "4",
  "expected": "4"
}
```

## Comparing models

Run the same suite against multiple models and compare:

```bash
for MODEL in claude-sonnet-4-6 gpt-4o gemini-2.0-flash; do
  omniroute eval suites run $SUITE_ID --model $MODEL --output json | jq '{model: .model, score: .score}'
done
```

## CI integration

```bash
# Run and fail CI if score drops below threshold
SCORE=$(omniroute eval suites run $SUITE_ID --model claude-sonnet-4-6 --output json | jq -r '.score')
python3 -c "import sys; score=float('$SCORE'); sys.exit(0 if score >= 0.90 else 1)"
```

## Errors

- `suites create` fails with `invalid rubric` → use one of: `exact-match`, `contains`, `llm-judge`, `regex`
- `suites run` returns `model not found` → verify model ID with `omniroute models --search <name>`
- `eval get` shows `status: failed` → check `omniroute logs --search eval` for error details
- `scorecard` returns empty results → the run may still be `running`; poll `omniroute eval get <runId>` until `status` is `completed`
<!-- skill:custom-end -->
