---
name: explore-dnn-model
description: Manual invocation only; use only when the user explicitly requests `explore-dnn-model` by name. Explore how to run a given DNN model checkpoint in the current Python environment by locating weights + upstream source code, resolving dependencies with user confirmation, running reproducible experiments under `tmp/`, and producing reports about I/O contracts, timing, and profiling.
---

# Explore DNN Model

## Minimum Required Inputs (Hard Requirement)

To use this skill, the user must provide:
- A model checkpoint / model file(s) as a **local** file or directory path (it may be outside the workspace).

If the user provides only the checkpoint path (no model name, repo link, or source code), proceed by:
1) Attempting to identify the model name/family from the checkpoint file/dir itself (filenames, adjacent configs/README, embedded metadata, `state_dict` key patterns, etc.).
2) Searching for the implementation in the workspace and/or alongside the checkpoint directory (e.g., nearby Python packages, inference scripts, config files).
3) If still not found, using the best-guess model name/family to search online for the canonical implementation, then cloning the upstream source into `tmp/<experiment-dir>/refs/` for investigation (prefer shallow clone; record URL + commit/tag used).

## Goals

This skill has three goals:

1) Verify that the given DNN model can work (inference or training; default focus is **inference**) in the *current* Python environment of the workspace.
2) Determine how to use it (inference or training; default is **inference**) by reading the upstream source code and producing minimal, reproducible runs.
3) Produce two reports:
   - **Experiment report** (programmatic): generated from `tmp/<experiment-dir>/outputs/` with minimal/no reasoning.
   - **Stakeholder report** (agent-written): generated by the agent from the experiment report + outputs/logs, with deeper analysis and recommendations.

The reports cover:
   - Input and output contracts (formats, shapes, dtypes, preprocessing/postprocessing)
   - Benchmarks and performance profiling (latency/throughput/memory, device details)
   - User-provided metrics/targets (e.g., accuracy, mAP, IoU, F1, latency budget), and whether/how they are met

Before changing anything, detect how the environment is managed by checking for:
- `pixi.toml` and/or `pyproject.toml` (Pixi-managed project)
- `.venv/` (venv-managed project)

## Dependency Policy (Ask Once, Then Apply)

If any dependency is missing:
- Do **not** install it automatically *without user confirmation*.
- List the missing packages (and versions/constraints if known) and ask the developer how to proceed.
- Provide clear options, let the developer choose, then proceed with the chosen approach.
- Once the developer confirms an approach, apply it for **all** newly required packages (no need to ask approval per package).

### Version Strategy

- First attempt: use the **latest versions** resolved by the selected package manager (`pixi`, `pip`, `uv`).
- If that fails (import/runtime errors, incompatibilities): fall back to the **specific versions/constraints** documented by the model’s upstream source code or docs.

### Preferred Options (in order)

**Pixi-managed env**
- Ask the user to choose one:
  - Modify the current Pixi environment by adding deps to the relevant manifest (`pixi.toml` / `pyproject.toml`).
  - Create a new Pixi environment specifically to test this model.
- Then use `pixi install`/`pixi run ...` to execute.
- Prefer **PyPI** packages over **conda-forge** when both are available.
- Avoid direct `pip install ...` into the Pixi environment unless the developer explicitly requests it.

**`.venv`-managed env**
Ask the user to choose one:
- Install deps via `pip` (or `uv pip`) into the current `.venv`.
- Create a new venv specifically for this model (keeps the repo venv clean).

## Inputs to Collect (ask if missing)

- Model name and/or upstream repo link and/or source code path (optional but speeds up identification)
- Model task/modality if unclear (classification/detection/segmentation/embedding/audio/video/etc.)
- Checkpoint path (file/dir) and format (`.pt`, `.pth`, `.onnx`, `.engine`, etc.)
- Any known I/O contract details (expected resolution, channel order, normalization, label mapping), if the user has them
- CPU-only requirement (only if the user explicitly requests CPU-only)
- Optional: user-provided metrics/targets to evaluate (quality and/or performance)

Notes:
- Determine framework/runtime automatically from checkpoint type + upstream code/docs + what’s available in the current Python environment.
- If hardware is unspecified, default to using hardware acceleration when available (CUDA GPU, ROCm GPU, Apple MPS, etc.). Use CPU-only only if the user requested it.
- If unspecified, the default objective is to confirm the model runs end-to-end from input → output (prefer real inputs found in the workspace; synthesize as a fallback) and record end-to-end timing.

## Core Workflow

### 0) Confirm artifacts and pick the target environment

- Confirm the minimum required inputs are present:
  - Checkpoint/model path is accessible locally (file/dir exists). It may be outside the workspace.
  - If model name/repo/source path is not provided, start by inferring it from the checkpoint and nearby files; if needed, locate it online and clone into `tmp/<experiment-dir>/refs/`.
- Detect environment type:
  - If both Pixi and `.venv` exist, ask the user which one should be treated as the “current” environment for this exploration.
- Device default:
  - If the user did not request CPU-only, use hardware acceleration when available (CUDA/ROCm/MPS/etc.).

### 1) Locate and read the upstream source code/docs

- First try to find the implementation locally:
  - Search the workspace and the checkpoint directory for source code, inference scripts, configs, and docs.
  - Prefer local source if it appears to be the canonical/official implementation for the checkpoint.
- If local source is not available or is clearly incomplete, use online search to find the canonical implementation:
  - Official GitHub repo, paper, model card, or vendor docs.
  - Check out the upstream repo under `tmp/<experiment-dir>/refs/<repo-name>` using a shallow clone (`--depth=1`), pinning a tag/commit when possible.
- Download/check out the relevant source code (pin a tag/commit when possible) and identify:
  - The exact inference entrypoints (scripts/modules), model class, preprocessing, postprocessing, and label mapping.
  - Any config files required to construct the model (YAML/JSON/TOML).
- Do not “guess” preprocessing/postprocessing: confirm from code and/or reference examples.

### 2) Derive required dependencies

Before running the model or changing the environment, determine the minimal dependencies required to run the model by using (in priority order):
- Upstream source code (setup files, `requirements*.txt`, `pyproject.toml`, import graph).
- Upstream docs/model card (pinned versions, known-good combos).
- Checkpoint type (e.g., `.onnx` implies ONNX Runtime; `.pt/.pth` implies PyTorch; `.engine` implies TensorRT).

Make a concise dependency list covering:
- Runtime/framework (e.g., `torch`, `onnxruntime`, `opencv-python`)
- Model-specific libs (e.g., `ultralytics`, `timm`, `transformers`, `mmengine`, etc.)
- Utility deps used by the official inference path (e.g., `numpy`, `Pillow`, `pyyaml`)
- Optional acceleration deps (CUDA/TensorRT) separated from the CPU baseline

### 3) Resolve missing dependencies (with user choice)

- Check whether each required dependency is available in the current environment.
- If anything is missing, ask the user which path to take:
  - **Pixi:** modify current manifest to add deps, or create a new Pixi env for this model.
  - **Venv:** install into current `.venv`, or create a new venv for this model.
- After the user confirms, apply the decision for all required packages (no per-package prompts).
- Use the **Version Strategy** above (latest first; fall back to pinned versions if needed).
- After dependency changes, run a quick smoke test:
  - Imports for the core runtime stack
  - Minimal “load model” path (without a full benchmark yet)

### 4) Ensure the checkpoint exists locally

- Do **not** download checkpoints automatically.
- Developers must provide checkpoints/model files (local file/dir paths).
- If the checkpoint is missing or only a URL is provided, ask the developer to download it and provide the local path.
- If the developer wants a conventional location, prefer `checkpoints/` (gitignored).
- Record provenance in a short note (based on what the developer provides):
  - Claimed source URL(s) or repo, version/commit/tag (if known), file size, and (if feasible) SHA256.

### 5) Create an experiment workspace under `tmp/`

Default experiment directory:

`<workspace>/tmp/<experiment-slug>-<time>`

If the user specifies a different location/name, use the user-provided one instead.

Create the standard directory layout:

```
tmp/<experiment-dir>/
  README.md     # experiment intent + directory guide (keep updated)
  refs/         # checked-out upstream repos (use shallow clone for online checkouts)
    README.md
  scripts/      # throwaway but reproducible scripts (committed if useful)
    README.md
  inputs/       # downloaded/synthesized test inputs
    README.md
  outputs/      # artifacts + machine-readable stats (e.g., `stats.json`)
    README.md
  logs/         # logs (stdout/stderr, profiling traces, command transcripts)
    README.md
  reports/      # markdown notes: what was tried, params, results
    README.md
    figures/    # images embedded in reports
    experiment-report.md
    stakeholder-report.md
```

Shell safety note (avoid accidental directory names):
- Do **not** use bash brace expansion to create these folders (e.g., `mkdir -p "$exp"/{refs,scripts,...}`), because quoting/spacing mistakes can create literal directories like `{refs,scripts,...}`.
- Prefer a simple loop or explicit `mkdir -p` calls, for example:

  ```
  exp="tmp/<experiment-dir>"
  mkdir -p "$exp"
  for d in refs scripts inputs outputs logs reports reports/figures; do
    mkdir -p "$exp/$d"
  done
  ```

Conventions:
- Use relative paths from `tmp/<experiment-dir>` in scripts so the folder is movable.
- Keep scripts small and single-purpose (`01_download_inputs.py`, `10_infer.py`, `20_visualize.py`, …).
- Run Python via the selected environment manager:
  - Pixi: `pixi run python ...`
  - Venv: use the venv’s Python (avoid system Python)

README requirements:
- Create `tmp/<experiment-dir>/README.md` to describe:
  - The intention of the experiment (what model, what checkpoint, what question you’re answering)
  - How to reproduce (one-line pointer to the primary script(s))
  - A brief map of what each top-level subdir contains
- Each top-level subdir must have its own `README.md` that:
  - Describes what belongs in the folder
  - Notes any important changes (append a short “Changes” section as you iterate)

### 6) Collect or synthesize inputs

- First try to find suitable inputs already present in the workspace (e.g., under `datasets/`, `downloads/`, or other project-specific data dirs) based on what you learned from the checkpoint/source code (task, modality, expected resolution, file types).
- If no suitable inputs exist locally, synthesize minimal inputs that satisfy the model contract (e.g., generated images, random tensors saved in the expected container format, short synthetic video).
- Save all chosen/generated inputs under `tmp/<experiment-dir>/inputs/`.

### 7) Run minimal, traceable inference experiments (default: inference + end-to-end timing)

- Start with a single known-good example (from upstream repo) if available.
- Save every “input → output” mapping:
  - Inputs: the exact file(s) used + preprocessing parameters.
  - Outputs: raw model outputs + any decoded/visualized artifacts.
  - Command line + environment notes (device, precision, batch size).
- Measure end-to-end timing by default:
  - At minimum: one cold run + a small number of warm runs (record mean/median).
- Persist stats that will appear in the report:
  - For any timing/profiling/memory/throughput numbers you plan to put into the report, also write a JSON version under `tmp/<experiment-dir>/outputs/` (e.g., `outputs/stats.json`).
- Capture logs by default:
  - Save stdout/stderr and command transcripts under `tmp/<experiment-dir>/logs/`.
- If the model is accessed via HTTP/gRPC, save request/response payloads (sanitized) under `reports/` and/or `outputs/`.

### 7b) (Optional) Training sanity check

If the user asks to validate training (or if inference is insufficient to validate “works”):
- Start with a minimal configuration (single batch / tiny subset) to confirm the forward + backward pass runs.
- Record key configs (optimizer, LR, batch size, mixed precision) and any dataset assumptions.
- Do not run long trainings unless the user explicitly requests it.

### 8) Produce reports

#### 8a) Ensure machine-readable report inputs exist (in `outputs/`)

Write/collect machine-readable files in `tmp/<experiment-dir>/outputs/` that the report generator can consume, at minimum:
- `stats.json` (timing/throughput/memory/profile numbers)
- A JSON describing key parameters used (preprocess/postprocess/runtime thresholds)
- A JSON describing the I/O contract (input expectations + output structure)
- A JSON listing key artifacts produced (paths to representative inputs/outputs)

Keep these JSON files as the source of truth for anything that will appear as “final stats” in the experiment report.

#### 8b) Generate `reports/experiment-report.md` programmatically

- Generate `tmp/<experiment-dir>/reports/experiment-report.md` by reading only `tmp/<experiment-dir>/outputs/` (and optionally `logs/` for pointers), with minimal/no reasoning.
- If images are part of the inputs/outputs, copy representative images into `tmp/<experiment-dir>/reports/figures/` and embed them in the markdown via relative paths (e.g., `figures/<name>.png`).

#### 8c) Write `reports/stakeholder-report.md` (agent-written)

- Read `reports/experiment-report.md` plus relevant `outputs/` and `logs/`.
- Produce `tmp/<experiment-dir>/reports/stakeholder-report.md` with deeper analysis that requires reasoning:
  - Interpret results vs expectations/targets
  - Call out risks, assumptions, and failure modes
  - Recommend next experiments and concrete integration guidance (if requested)
  - Summarize “go/no-go” criteria and what remains unknown

Also include:
- **Benchmark & profiling** results:
  - CPU/GPU model, RAM/VRAM, OS, Python version, key library versions
  - Latency breakdown if possible (preprocess / model / postprocess)
  - Throughput (items/s) and peak memory/VRAM
- **Stats JSON**:
  - For any stats included in the report, ensure the same values exist in a JSON file under `tmp/<experiment-dir>/outputs/` (e.g., `outputs/stats.json`).
- **User metrics** (if provided):
  - The metric definition + measurement method
  - Results on the chosen evaluation inputs
  - Any deltas vs the user’s targets and suggested next experiments

## Guardrails

- Do not commit large checkpoints or huge outputs; keep them under gitignored paths (`checkpoints/`, `tmp/`).
- Respect upstream licenses; record the repo URL + commit/tag in `reports/`.
- Avoid modifying runtime code under `src/` unless the user explicitly requests integration; keep exploration isolated to `tmp/<experiment-dir>`.
