---
name: benchguard-audit
description: Audit scientific or agent benchmarks using BenchGuard's cross-artifact verification methodology. Use when the user asks Codex to inspect benchmark tasks for instruction, ground-truth, evaluator, environment, or scoring defects; to produce BenchGuard-style JSON/Markdown audit reports; or to review a standard-format benchmark without installing the Python package.
---

# BenchGuard Audit

Use BenchGuard's recall-oriented benchmark-audit methodology to find defects in
the benchmark itself, not in the agents being evaluated. Prefer concrete,
evidence-backed findings and treat possible issues as `WARNING` unless scoring or
correctness harm is demonstrated.

## Inputs

The user should provide a benchmark path. If no path is provided, ask for it.
Optional inputs may include task IDs and an output directory; default output is
`./benchguard-output/`.

This skill is optimized for BenchGuard standard-format benchmarks: immediate
child task directories containing `task.toml`, with optional `instruction.md`,
`solution/`, `tests/`, `environment/`, `domain_knowledge.md`, and
`data_description.md`. Treat task-owned notebooks, analysis scripts, grader
utilities, grader prompts, rubrics, and reference traces as definition artifacts
when they determine the intended answer or scoring behavior.

## References

Load only what is needed:

- `references/taxonomy.md`: category, subcategory, severity, and confidence rules.
- `references/definition-audit.md`: audit methodology and false-positive guardrails.
- `references/standard-format.md`: task discovery and file composition rules.
- `references/output-format.md`: report and per-task JSON schemas.

Always read `taxonomy.md` and `definition-audit.md` before auditing findings.
Read `standard-format.md` before discovery if the layout is unfamiliar. Read
`output-format.md` before writing reports.

## Workflow

1. Discover tasks.
   - Use fast file search (`rg --files`, `find`, or equivalent) to locate
     immediate child `task.toml` files under the benchmark root.
   - The task ID is the task directory name, sorted lexicographically.
   - If the user supplied task IDs, filter to those IDs.
   - Load root-level `benchguard_hints.yaml` if present. Invalid or non-mapping
     YAML should be treated as empty hints with a warning.

2. Decide audit scope.
   - A task is auditable when `solution/` or `tests/` exists and has files.
   - Tasks without auditable content should still appear in `task_ids_audited`
     and per-task output with no findings.
   - Audit every task independently. For many tasks, batching is fine only if
     each task still receives its own full artifact context, candidate list, and
     validation pass. Do not replace per-task audit with a single benchmark-wide
     skim. Use subagents only when the user explicitly requested delegated or
     parallel agent work; otherwise audit locally in small task batches.

3. Audit each task.
   - Read `instruction.md` if present. If it contains wrapper instructions
     around a smaller task question, extract the core task question first and
     treat surrounding workflow text as agent-facing procedure, not necessarily
     the scientific/statistical specification.
   - Read all task-owned reference-solution artifacts. In `solution/`, include
     text/source files and readable notebooks (`.ipynb` JSON cells, `.py`,
     `.R`, `.Rmd`, `.qmd`, shell scripts, etc.). `solve.sh` is often only an
     answer-emitting wrapper; never stop there when a notebook or analysis
     script is available. Prefix each file with its relative path.
   - Read text files in `tests/`, prioritizing `test.sh`, then alphabetical
     order. Include local grader helpers, rubric prompts, upstream grader
     utilities, and imported task-local files when they determine scoring.
   - Read text files in `environment/`, prioritizing `Dockerfile`, then alphabetical order.
   - Read `domain_knowledge.md`, `data_description.md`, `task.toml`, and other
     small top-level task-owned reference files when they define the intended
     answer, method, data interpretation, or evaluator behavior. Do not read
     large data files unless the user explicitly asks for data-level checking.
   - Build a per-task fact table before proposing findings:
     `task asks`, `reference solution computes`, `evaluator checks`,
     `environment/runtime assumptions`, and `hidden choices implied by artifacts`.
     Use this table to compare semantic choices such as methods, thresholds,
     covariates, filters, grouping rules, metric definitions, units, output
     formats, tolerances, and software-specific behavior.
   - Apply hints from `benchguard_hints.yaml`.
   - Report only issues supported by evidence from task artifacts.
   - Keep findings atomic: one independently fixable root cause per finding.

4. Validate and filter candidate findings.
   - Treat the first pass as a candidate queue, not the final answer.
   - For every candidate, perform a validity check before keeping it:
     - The candidate must identify a concrete benchmark-definition defect in
       the task artifacts: a prompt/ground-truth/rubric/evaluator conflict, a
       hidden answer-changing rule, an evaluator that can accept/reject valid
       answers incorrectly, or a demonstrated environment definition problem.
     - The evidence must cite the artifact path and, when possible, the exact
       line(s) or snippets that prove the issue.
     - Drop candidates that are not benchmark-owned defects: transient network
       access, public data/API availability, missing optional convenience tools,
       missing runtime-mounted files, import/path errors, or agent-strategy
       failures are not definition bugs by themselves.
     - Drop candidates that are only generic auditability concerns,
       maintainability concerns, missing executable-reference complaints,
       rubric-only-evaluation complaints, or "the process rubric/tool workflow
       is hidden" complaints. Keep a process-rubric finding only when it proves
       scoring harm or a hidden semantic requirement that changes the valid
       final answer.
     - Drop candidates based only on plausibility, domain intuition, or model
       suspicion unless the local artifacts prove the mismatch. If external
       database/source checking is required, mark it as a low-confidence
       `WARNING` only when it is still useful as a review item; otherwise drop it.
     - Drop candidates where another reasonable interpretation or solution
       approach remains valid and the evaluator/gold does not demonstrably
       reject it.
     - Downgrade to `WARNING` when the issue is real but limited to formatting,
       reproducibility, or unclear wording without demonstrated scoring harm.
     - Do not let generic evaluator metadata mismatches, wrapper-script
       complaints, or LLM-judge concerns crowd out the semantic definition
       audit. Even when such a warning is real, still check the reference
       solution and evaluator for hidden answer-changing methods, thresholds,
       filters, covariates, grouping rules, metrics, units, and tolerances.
   - Deduplicate semantically before reporting:
     - Use the one-fix test. If one edit would fix multiple candidate findings,
       keep one finding and merge the evidence/recommendation.
     - Prefer the most specific root cause over downstream symptoms. For
       example, one prompt-vs-rubric mismatch should not appear separately as
       both a ground-truth issue and an evaluation issue.
     - Keep the version with the strongest evidence, highest confidence, and
       clearest recommendation.

5. Normalize findings.
   - Validate `category`, `subcategory`, `severity`, and `finding_type` against
     the taxonomy.
   - Set `task_id`, `protocol: "definition"`, and ISO 8601 UTC `timestamp`.
   - Derive `confidence_level`: `CONFIRMED` for `>=0.8`, `LIKELY` for `>=0.55`,
     `POSSIBLE` for `>=0.3`; drop findings below `0.3`.
   - Resolve `evidence_quality`: line numbers imply `line_cited`; file/snippet
     without lines implies `snippet_or_file`; otherwise `generic`.
   - Deduplicate findings again using the one-fix test after normalization.

6. Write reports.
   - Create `<output_dir>/<benchmark_slug>/<model_slug>_<timestamp>/`.
   - Write `report.json`, `report.md`, and `per_task/<task_slug>.json` for every
     audited task.
   - Use the schemas and sort order in `references/output-format.md`.
   - Use `metadata.tool: "benchguard-codex-skill"`, `audit_mode: "definition"`,
     and `version: "0.1.0"`.

7. Summarize for the user.
   - State task counts, BUG/WARNING counts, and the highest-impact findings.
   - Include the report paths.
   - Make clear that this is a recall-oriented expert-review queue, not an
     automatic final judgment.

## Reporting Discipline

Default to `WARNING`. Upgrade to `BUG` only when there is concrete evidence of a
wrong algorithm, contradiction, crash, or accept/reject scoring harm. Do not flag
missing runtime-mounted data, public data downloads, LLM judge APIs, or generic
auditability concerns as concrete bugs without deeper evidence.