--- name: agentv-trace-analyst description: >- Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill. --- # AgentV Trace Analyst Analyze evaluation traces headlessly using `agentv inspect` primitives and `jq`. ## Primitives ```bash # List result files (most recent first) agentv inspect list [--limit N] [--format json|table] # Show results with trace details agentv inspect show [--test-id ] [--tree] [--format json|table] # Percentile statistics agentv inspect stats [--group-by target|suite|test-id] [--format json|table] # A/B comparison between runs agentv compare [--threshold 0.1] [--format json|table] ``` ## Analysis Workflow ### 1. Discover results ```bash agentv inspect list ``` Pick the result file to analyze. Most recent is first. ### 2. Get overview ```bash agentv inspect stats ``` Read the percentile table. Key signals: - **score p50 < 0.8**: Significant quality issues - **latency p90 > 30s**: Performance bottleneck - **cost p99 spike**: Outlier cost tests to investigate - **tool_calls p90 >> p50**: Some tests are much chattier ### 3. Investigate failures ```bash agentv inspect show --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]' ``` For each failing test, examine: - **assertions (failed)**: What criteria were not met? (filter for `passed: false`) - **trace.tool_calls**: Did the agent use expected tools? - **duration_ms**: Did it time out or run too long? - **reasoning**: Why did the grader score it low? ### 4. Inspect specific tests ```bash # Flat view with trace summary agentv inspect show --test-id # Tree view (if output messages available) agentv inspect show --test-id --tree ``` The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for: - **Excessive tool calls**: Agent looping or exploring unnecessarily - **Missing tools**: Expected tool not called - **Long durations**: Specific tool calls that are slow ### 5. Compare runs ```bash agentv compare ``` Look for: - **Wins vs losses**: Net improvement or regression? - **Mean delta**: Overall direction of change - **Per-test deltas**: Which tests regressed? ### 6. Group analysis ```bash # By target provider agentv inspect stats --group-by target # By suite agentv inspect stats --group-by suite ``` Compare providers side-by-side: which is cheaper, faster, more accurate? ## Advanced Queries with jq All commands support `--format json` for piping to `jq`: ```bash # Top 3 most expensive tests agentv inspect show --format json \ | jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}' # Tests where token usage exceeds 10k agentv inspect show --format json \ | jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]' # Score distribution by suite agentv inspect show --format json \ | jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}' # Tool usage frequency across all tests agentv inspect show --format json \ | jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}' # Find regressions > 0.1 between two runs agentv compare baseline.jsonl candidate.jsonl --format json \ | jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}' ``` ## Reasoning Patterns When analyzing traces, think about: 1. **Efficiency**: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context. 2. **Error patterns**: Do failures cluster by target, suite, or tool usage? Common patterns: - Tool errors → agent can't access required resources - High LLM calls with low tool calls → agent stuck in reasoning loop - Missing tool calls → wrong tool routing 3. **Cost optimization**: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare `--group-by target` stats. 4. **Latency distribution**: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers. 5. **Regression detection**: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.