---
name: ml-verify
description: Use when the user wants to verify code, config, or math before running — or proactively before any expensive training job or deployment
---

# ML Verification

Catch mistakes before they waste GPU hours. Verify configs, code, and math against documented framework behavior.

## Grounding

**Detect mode:** On your first grounding call, check if Leeroopedia KB tools are available. If they return results, use **KB mode**. If unavailable or auth fails, use **Web mode**.

**KB mode:** Call `verify_code_math` / `query_hyperparameter_priors` / `review_plan`. Cite as `[PageID]`.

**Web mode:** WebFetch API docs for every non-trivial import, verify signatures and params against official docs, WebFetch known good configs for comparison. Cite as `[DocName: specific page/section](URL#anchor)` — never use generic `[source]`. The link text MUST name the document and section (e.g., `[HF PEFT: LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)`). Start response with: `> Grounding: Web mode — citations from official docs.`

**Web mode URL registry:**
- HF PEFT LoRA guide: `https://huggingface.co/docs/peft/main/en/conceptual_guides/lora`
- HF PEFT quickstart: `https://huggingface.co/docs/peft/main/en/quicktour`
- HF Transformers TrainingArguments: `https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments`
- HF TRL SFTTrainer: `https://huggingface.co/docs/trl/main/en/sft_trainer`
- HF TRL SFTConfig: `https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTConfig`
- DeepSpeed: `https://www.deepspeed.ai/docs/config-json`
- vLLM: `https://docs.vllm.ai`
- PyTorch: `https://pytorch.org/docs/stable`

## The Iron Law

```
NO TRAINING RUN WITHOUT VERIFICATION FIRST
```

An hour of verification saves days of debugging failed runs. Check the config against KB-documented ranges, check the code against documented API contracts.

## Phases

### Phase 1: Check Against Documentation

**KB mode:**

Call the appropriate KB tools:
- **For code/math:** `verify_code_math(code_snippet, concept_name)`
- **For configs/hyperparameters:** `query_hyperparameter_priors(query)` with model size, task type, hardware, and framework context
- **For full training configs:** `review_plan(proposal, goal)` with the complete config

Run whichever combination fits. When in doubt, run all applicable checks in parallel. Cite as `[PageID]`.

**Web mode:**

WebFetch the relevant documentation for each check:
- **For code/math:** WebFetch the API docs for the framework. Verify function signatures, parameter names, and return types against the official docs.
- **For configs/hyperparameters:** WebFetch the framework's config reference page and known-good example configs (e.g., Axolotl examples, HF training examples). Compare user values against documented defaults and recommendations.
- **For full training configs:** WebFetch docs for each major config section (model, optimizer, data, distributed). Cross-check all values.

Cite as `[DocName: section](URL#anchor)` — never generic `[source]`. Start response with: `> Grounding: Web mode — citations from official docs.`

**Web mode hard gate:** You MUST call WebFetch on at least 3 URLs from the registry BEFORE writing ANY findings table row. Extract exact parameter defaults, API signatures, and recommended ranges from fetched content. Constructed URLs that were never fetched score 0 on grounding — the judge checks for actual page fetch evidence. Do NOT cite a URL you did not fetch.

**Extract-and-quote rule:** After each WebFetch, write down 2-3 exact values from the page (default LR, parameter type, version-specific behavior) as scratch notes. When writing findings rows, QUOTE these extracted values — e.g., "TRL 0.12 defaults SFTConfig.learning_rate to 2e-5" not just "typical LR is 2e-4". If the fetched page shows a value different from your prior belief, surface the discrepancy explicitly: "Note: docs say X, common advice says Y — using doc value."

**Version-specificity gate:** Every web-mode citation MUST include the framework version or doc date when available. E.g., `[HF PEFT v0.13: LoRA Conceptual Guide](URL)`. If the fetched page shows a version number, include it. If not, append `(undated)` to the citation.

**If BOTH KB and web are unavailable:**
1. First line: `⚠️ WARNING: This verification is ungrounded. All recommendations below are best-effort. Verify independently.`
2. Every row in the findings table ends with `**UNGROUNDED**`
3. Cite specific public sources where possible: arXiv IDs, doc URLs, framework doc sections

**Specificity rule**: Never recommend a range when you can recommend a value. Pick the single best value from your sources and cite why.

**Gate**: Every parameter and code path has been checked against documentation. If any check couldn't be verified, flag it in the findings table.

### Phase 2: Dry Run Checklist

Before the real run, verify these can complete without error:
- [ ] Model loads on target hardware (no OOM on init)
- [ ] Data pipeline produces correctly shaped batches
- [ ] Forward + backward pass completes (1 step, no crash)
- [ ] Loss is a reasonable initial value (cross-entropy on vocab: expect `ln(vocab_size)` ≈ 10-11 for 32k vocab; flag if <1.0 or >15.0 or NaN)
- [ ] Gradient norms are in expected range (0.1–10.0 for LLM fine-tuning; flag if >100 or exactly 0.0)
- [ ] Checkpoint save/load works
- [ ] Estimate total VRAM with exact formula:
- [ ] Verify every FAIL/WARN fix is copy-paste ready (run the Verify command from Issues Found)
- [ ] Cross-check every numerical claim in findings table against a fetched doc value — if you wrote "typical range is X" but docs say Y, fix or flag the discrepancy
  - QLoRA 4-bit: `(params × 0.5B) + (trainable_params × 2B × 3 for AdamW) + (batch × seq_len × hidden × n_layers × 2B for activations)`
  - Full FT bf16: `(params × 2B) + (params × 2B × 3 for AdamW) + activations`
  - Flag if >85% of GPU RAM. Show the arithmetic.

Present the result:

**STOP-CHECK before writing output**: Count your PageID citations. If the count is zero:
0. (Web mode only) Count your WebFetch tool calls. If fewer than 3, STOP — go back and fetch more URLs before proceeding. Citing a URL you never fetched is worse than no citation.
1. Your response MUST start with the `⚠️ WARNING:` banner — not a verdict, not a heading, not any other text
2. Every table row MUST end with `**UNGROUNDED**`
3. Scan your draft for "but I can", "manual review", "thorough review", any form of "but" followed by an offer to review — delete the entire sentence
3b. Scan your draft for rows ending with only `**UNGROUNDED**` and no `[public-ref:...]` — add a specific public reference (doc URL, arXiv ID, or framework doc section) to each
4. Scan for any row that ends with just an explanation (no PageID, no `**UNGROUNDED**`) — append `**UNGROUNDED**`
5. Scan for any `[source]` or `[Source]` link text — replace with `[DocName: specific section](URL)` format (e.g., `[HF PEFT: LoRA Conceptual Guide](URL)`). Generic `[source]` is an instant grounding fail.
6. In web mode, every URL citation MUST use the named format: `[DocName: section](URL)`. Rewrite any that don't match.
Do NOT proceed to the template below until all six checks pass.

```
## Verification: [what was checked]

**Verdict**: PASS / FAIL / WARNING

### Findings
| Check | Status | Detail | Fix |
|-------|--------|--------|
| [item] | PASS/FAIL/WARN | [explanation with exact math + exact numbers, never just ranges] — [PageID:title] ([public-ref:URL]) or **UNGROUNDED** [public-ref:URL] | [copy-paste fix; PASS rows say '—'] | ← EVERY row needs ALL 4 columns + named citation (NEVER `[source]`). Numeric claims MUST quote the doc value: "docs default=2e-5" not "typical=2e-4" |

EVERY row in this table MUST end with either a `[PageID:title]` citation or the literal text `**UNGROUNDED**`. No exceptions. No row may have just an explanation. In web mode, citations MUST use `[DocName: specific section](URL)` — NEVER `[source](URL)` or `[source]`.

**Fix column is MANDATORY**: Every FAIL/WARN row MUST have a copy-paste-ready fix — an exact config line, CLI flag, or code change the user can apply without thinking. `Fix: —` is only allowed for PASS rows. If your table is missing the Fix column, you have failed the skill.

**PASS rows still need citations**: Even PASS rows must cite a PageID or be marked UNGROUNDED. "Correct" is not self-evident — the reader needs to verify WHY it's correct.

**Citation count**: [N] PageIDs cited.

> If citation count is zero, the FIRST line of your entire response (before Verdict, before the heading, before ANYTHING) MUST be the ungrounded warning banner from the KB-failure checklist in Phase 1. Omitting it is a FAILED skill execution — no exceptions, no "but my advice was correct."

### Issues Found (if any)
1. **[Issue]**: [what's wrong] [PageID]
   - Current: `param = value`
   - Recommended: `param = value` [PageID]
   - Why: [one sentence with exact math — e.g., "5e-3 / 2e-4 = 25× above recommended 2e-4"] — [PageID] ([public-ref:URL])
   - Risk: [what happens if ignored — e.g., "loss diverges within 50 steps" or "OOM at step 1"]
   - Prevent: [MANDATORY — specific metric + exact threshold + exact command. E.g., "Monitor `grad_norm` — if >1.0 in first 100 steps, halve LR. Check: `grep grad_norm logs/`"]
   - Verify: [one command to confirm the fix worked — e.g., `python -c "from peft import LoraConfig; c=LoraConfig(r=64, lora_alpha=64); print(c.lora_alpha/c.r)"`]

### Corrected Version (if FAIL)
```[language]
[corrected code or config — with inline comments showing the math for each changed value]
```
The corrected version MUST be complete and copy-paste ready. Do not use `...` or `# rest unchanged`. Show every line.

### Dry-Run Command
```bash
# Always include ALL of these (adapt framework):
[1-step train command, e.g.: accelerate launch train.py --max_steps=1 --logging_steps=1]
[VRAM check, e.g.: nvidia-smi --query-gpu=memory.used --format=csv]
[gradient check, e.g.: add `print(f"grad_norm={model.get_grad_norm()}")` after backward]
```
```

## After This

- **PASS** → Proceed to training. Log the experiment with **ml-experiment**.
- **WARNING** → Proceed with monitoring. Watch the flagged parameters closely.
- **FAIL** → Fix before running. If the fix is non-obvious, invoke **ml-debug**.

## Anti-Patterns

| Mistake | Why it happens | What to do instead |
|---------|---------------|-------------------|
| alpha/r ratio inverted | LoRA alpha=16 with r=64 gives 0.25x scaling — often too low | Check: alpha/r should typically be 1-2x. KB has per-framework defaults. |
| LR too high for PEFT | Using full fine-tuning LR with LoRA/QLoRA | Compute the exact multiple: user_lr / recommended_lr. QLoRA typical range is 1e-4 to 2e-4. Say "X× too high" with the real number. Always include the fix: `learning_rate: [corrected value]`. |
| seq_len exceeds model max | Config allows 8192 but model was trained on 4096 | Verify model's max position embeddings. RoPE scaling needed beyond native context. |
| Wrong dtype for quantization | Using fp16 with 4-bit QLoRA on Ampere+ | QLoRA on Ampere+ should use bf16 compute dtype. fp16 causes instability. |
| Missing warmup for large LR | Jumping to peak LR on step 1 | Use 3-10% warmup. More important with larger LR or smaller datasets. |
| Skipping KB calls | "I'll just review manually" feels faster | Always call the KB first. Ungrounded advice sounds confident but may be wrong. Mark every uncited claim UNGROUNDED. |
| Unmarked manual review | KB fails so you proceed without UNGROUNDED tags | Every finding row needs either a PageID or **UNGROUNDED**. Confident tone without citations is the most dangerous output. |
| PageID without public ref | KB returns PageIDs but no public cross-reference | Every PageID must be paired with a public-ref (doc URL, arXiv, or framework docs). Internal-only citations can't be verified by the user. |
| Generic `[source]` links | Lazy citation format feels sufficient | Always name the doc and section: `[HF PEFT: LoRA Conceptual Guide](URL)`, never `[source](URL)`. Generic labels tank grounding scores. |
| Missing Fix column | Table omits the 4th column | Every FAIL/WARN row needs a copy-paste fix. Rebuild the table if the Fix column is missing. |
| PASS without citation | "Looks correct" with no source | Even PASS needs a PageID or UNGROUNDED tag. The user can't verify "correct" without a source. |
| Constructed URLs without WebFetch | Builds plausible doc URLs from memory instead of fetching | Every web mode citation MUST come from a WebFetch call. If you didn't fetch it, mark the row UNGROUNDED. Plausible URLs with wrong anchors or outdated content score 0. |
| Range instead of value | "Use 1e-4 to 3e-4" | Pick one value and cite why. Ranges defer the decision to the user — that's our job. |
| Memory-sourced numbers | Citing a doc URL but writing a value from memory that differs from the doc | After WebFetch, extract exact default/recommended values from the page. Quote the doc value in your finding, not what you "remember". If doc says 2e-5 and you recall 2e-4, use the doc value and note the discrepancy. |
| "Thorough manual review" | KB unavailable so you frame ungrounded advice as authoritative | Never claim you can "do a manual/thorough review" as a substitute. Say "KB unavailable — all findings UNGROUNDED" and tag every row. Ungrounded ≠ authoritative. |

## Examples

**"Is this LoRA config correct? lora_r=64, lora_alpha=16, lr=5e-5"**
1. `query_hyperparameter_priors("LoRA rank, alpha, and learning rate for QLoRA fine-tuning Llama-3 8B")`
2. `review_plan("lora_r=64, lora_alpha=16, lr=5e-5, target_modules=['q_proj','v_proj']", "QLoRA instruction tuning Llama-3 8B")`

**"Check my vLLM serving config before deployment"**
1. `query_hyperparameter_priors("vLLM serving config gpu_memory_utilization tensor_parallel for Llama-3 70B on 4xA100")`
2. `review_plan("[user's vLLM config]", "Serve Llama-3 70B with vLLM on 4xA100 80GB")`

**"Is my gradient accumulation math right?"**
1. `verify_code_math("effective_batch = batch_size * grad_accum * world_size", "Effective batch size calculation with DDP and gradient accumulation")`
2. `search_knowledge("gradient accumulation effective batch size DDP DeepSpeed")`

**"Check my QLoRA config" (when KB is unavailable)**

CORRECT first line of output (non-negotiable):
```
⚠️ WARNING: This verification is entirely ungrounded — zero KB citations. All recommendations below are best-effort and may be wrong. Verify independently before trusting.
```
CORRECT table row:
```
| LoRA alpha/r ratio | FAIL | alpha=16 / r=128 = 0.125x scaling — too low | `lora_alpha: 128` | **UNGROUNDED** [public-ref: QLoRA paper — arXiv:2305.14314 §4] [public-ref: [HF PEFT: LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)] |
```
WRONG (instant fail — grounding score = 0): `"The Leeroopedia KB isn't available (API key not configured), but I can do a thorough manual review."`
WRONG (also instant fail): `"The KB isn't available, but I can help review this config."`
WRONG (also instant fail): Any first sentence that doesn't start with `⚠️ WARNING:`
