---
name: methodology
description: >
  Analyze captured HTTP traffic, design CLI architecture, and implement the Python
  CLI package. Covers Phase 2 of the pipeline: parse raw-traffic.json, identify
  protocol type, map endpoints, design Click command groups, implement with parallel
  subagents.
  TRIGGER when: "analyze traffic", "design CLI", "implement CLI", "build CLI from
  network traffic", "generate API wrapper", "reverse engineer web API", "start Phase 2",
  raw-traffic.json exists and capture is complete, or after the capture skill finishes.
  DO NOT trigger for: traffic recording (use capture), test writing (use testing),
  or quality checks (use standards).
version: 0.2.0
---

# CLI-Anything-Web Methodology (Phase 2)

Analyze captured traffic, design the CLI command structure, and implement the
complete Python CLI package. This skill owns the core transformation from raw
HTTP traffic to a production-ready CLI.

---

## Prerequisites (Hard Gate)

Do NOT start unless:
- [ ] `raw-traffic.json` exists (with WRITE operations, or read-only GET-only traffic)
- [ ] Auth state was captured during Phase 1 (if the site requires auth)

If raw-traffic.json is missing or has no WRITE operations, invoke the
`capture` skill first.

**Exception for read-only sites:** If the site is genuinely read-only (search engine,
dashboard, analytics viewer with no create/update/delete), the trace may contain only
GET requests. In this case, note "read-only site — no write operations" in `<APP>.md`
and proceed. The generated CLI will have read-only commands (list, get, search) but
no create/update/delete commands. This is valid.

**No-auth sites:** If the target site requires no authentication (public API,
no login needed), the "Auth state captured" prerequisite does not apply. Note
"no-auth site" in `<APP>.md` and proceed.

---

## Step A: Analyze (API Discovery)

**Goal:** Map raw traffic to a structured API model.

**Process:**

0. **Read `traffic-analysis.json` first** (if it exists alongside `raw-traffic.json`).
   This file is auto-generated by `parse-trace.py` or `mitmproxy-capture.py` → `analyze-traffic.py` and contains
   pre-detected protocol type, auth pattern, endpoint grouping, GraphQL operations,
   batchexecute RPC IDs, and suggested CLI commands. Use it as a starting point —
   verify its findings and fill in anything marked "unknown" by reading `raw-traffic.json`
   manually.

   **Enhanced analysis (v1.3.0, when captured via mitmproxy-capture.py):**
   - `request_sequence`: Timeline-ordered requests with auth flow detection (login → token → API calls)
   - `session_lifecycle`: Cookie inventory, auth cookie identification, session pattern (cookie_auth/token_refresh/no_session)
   - `endpoint_sizes`: Response body size classification per endpoint (small/medium/large) and total data transferred
   These fields are only present when `mitmproxy-capture.py` was used. If missing (`has_timestamps: false`), rely on manual analysis.

   If `traffic-analysis.json` doesn't exist, run the analyzer:
   ```bash
   python ${CLAUDE_PLUGIN_ROOT}/scripts/analyze-traffic.py \
     <app>/traffic-capture/raw-traffic.json --summary
   ```

1. Parse `raw-traffic.json` (for details the analyzer couldn't extract)
2. Group requests by base path (e.g., `/api/v1/boards/`, `/api/v1/items/`)
3. For each endpoint group, identify:
   - HTTP method (GET/POST/PUT/DELETE/PATCH)
   - URL pattern (extract path parameters like `:id`)
   - Query parameters and their types
   - Request body schema (JSON fields, types, required/optional)
   - Response body schema
   - Authentication method (Bearer token, cookie, API key)
   - Rate limiting signals (429 responses, retry-after headers)

4. **Identify RPC protocol type** -- classify the API transport:

   | Protocol | Detection Signal | Client Pattern |
   |----------|-----------------|----------------|
   | REST | Resource URLs (`/api/v1/boards/:id`), standard HTTP methods | `client.py` with method-per-endpoint |
   | GraphQL | Single `/graphql` endpoint, `query`/`mutation` in body | `client.py` with query templates |
   | gRPC-Web | `application/grpc-web` content type, binary payloads | Proto-based client |
   | Google batchexecute | `batchexecute` in URL, `f.req=` body, `)]}'\n` prefix | `rpc/` subpackage (see `references/google-batchexecute.md`) |
   | Custom RPC | Single endpoint, method name in body, proprietary encoding | Custom codec module |
   | Public REST API | Documented `/api/` endpoints, OpenAPI spec, JSON responses | Standard `client.py` with httpx |
   | Plain HTML (no framework) | No SPA root, no framework globals, data in `<table>`/`<div>` | `client.py` with httpx + BeautifulSoup4 |

   This determines client architecture in Step B -- REST uses simple `client.py`,
   non-REST protocols need a dedicated `rpc/` subpackage with encoder/decoder/types.

5. Detect data model:
   - Entity types (boards, items, users, projects...)
   - Relationships (board has many items, item belongs to board)
   - ID formats (UUID, numeric, slug)

6. Detect auth pattern:
   - Cookie-based sessions
   - Bearer/JWT tokens
   - OAuth refresh flow
   - API key headers
   - Browser-delegated auth: tokens embedded in page JavaScript (e.g., `WIZ_global_data`),
     not in HTTP headers. Requires CDP for initial cookies, HTTP for token extraction.
     See `references/auth-strategies.md` "Browser-Delegated Auth" section.
   - No auth / public access: fully public API, no login required. CLI may
     optionally support API key auth for write operations (e.g., dev.to).

7. Write `<APP>.md` -- software-specific SOP document

**Output:** `<APP>.md` with API map, data model, auth scheme.

**References:** `traffic-patterns.md`, `google-batchexecute.md`, `ssr-patterns.md`

---

## Step B: Implement (Code Generation)

### Study Existing CLIs First (Critical for Accuracy)

Before implementing, **read an existing CLI that uses the same protocol** as your
target. These are battle-tested implementations that solved the same problems you'll face.

| Protocol | Reference CLI | Key files to read |
|----------|--------------|-------------------|
| **Google batchexecute** | `notebooklm/agent-harness/cli_web/notebooklm/` | `core/rpc/encoder.py`, `core/rpc/decoder.py`, `core/client.py`, `core/auth.py` |
| **GraphQL + WAF** | `booking/agent-harness/cli_web/booking/` | `core/client.py` (curl_cffi + GraphQL), `core/auth.py` (WAF tokens) |
| **HTML scraping** | `futbin/agent-harness/cli_web/futbin/` | `core/client.py` (httpx + BS4), `commands/players.py` |
| **HTML + Cloudflare** | `producthunt/agent-harness/cli_web/producthunt/` | `core/client.py` (curl_cffi impersonate) |
| **REST API** | `unsplash/agent-harness/cli_web/unsplash/` | `core/client.py`, `commands/photos.py` |
| **Simple HTML** | `gh-trending/agent-harness/cli_web/gh_trending/` | Minimal structure example |

**How to use reference CLIs:**

1. Read the reference CLI's `core/client.py` — understand the request/response pattern
2. Read `core/auth.py` — copy the login_browser() pattern exactly for Google apps
3. Read `core/rpc/` (for batchexecute) — understand encoder/decoder, DO NOT reinvent
4. Read `commands/` — see how Click commands are structured, how --json works
5. Read `utils/helpers.py` — see handle_errors(), _resolve_cli(), repl patterns

**For batchexecute apps specifically**, the notebooklm CLI is your bible:
- Copy the encoder/decoder architecture (don't reinvent the batchexecute wire format)
- Copy the auth token extraction pattern (CSRF, session ID, build label)
- Copy the cookie domain priority logic (critical for Israeli/international users)
- Adapt the RPC method IDs and param structures to your target app

The agent implementing the CLI MUST read these files before writing code. Use the
`Agent` tool to dispatch a research agent that reads
the reference implementation while you design the command structure.

### Design Before You Code

Before writing any code, note the command structure in `<APP>.md` (10 minutes max):

- Map each API endpoint group to a Click command group:
  - `/api/v1/boards/*` → `boards` command group
  - `/api/v1/items/*` → `items` command group
- Map CRUD operations to subcommands (GET list → `list`, GET single → `get`,
  POST → `create`, PUT/PATCH → `update`, DELETE → `delete`)
- Note auth design: `auth login`, `auth status`, `auth refresh`; credentials at
  `~/.config/cli-web-<app>/auth.json`
- Note REPL design: bare command enters REPL, branded banner via `repl_skin.py`

**Goal:** Generate the complete Python CLI package.

### Package Structure

See HARNESS.md "Generated CLI Structure" for the complete package template.
Key points: `cli_web/` namespace (NO `__init__.py`), `<app>/` sub-package (HAS `__init__.py`),
`core/`, `commands/`, `utils/`, `tests/` directories.

### Step B.0: Scaffold Core Modules

Run the scaffold generator script to create all boilerplate files:

```bash
python ${CLAUDE_PLUGIN_ROOT}/scripts/scaffold-cli.py <app>/agent-harness \
  --app-name <app> \
  --protocol <rest|graphql|html-scraping|batchexecute> \
  --http-client <httpx|curl_cffi> \
  --auth-type <none|cookie|api-key|google-sso> \
  --resources <comma-separated-resources> \
  [--has-polling] [--has-context] [--has-partial-ids]
```

This generates exceptions.py, client.py skeleton, helpers.py, config.py, output.py,
the CLI entry point with REPL, setup.py, conftest.py, repl_skin.py, and (for
batchexecute) the rpc/ subpackage.

> **Fallback**: If the script is unavailable, read `${CLAUDE_PLUGIN_ROOT}/skills/boilerplate/SKILL.md`
> and follow its instructions to scaffold manually.

After scaffolding, review the generated files and customize `client.py` with actual
endpoint methods from `<APP>.md`.

### Implementation Rules

- **`exceptions.py`** -- implement first. Required types: AppError (base), AuthError(recoverable), RateLimitError(retry_after), NetworkError, ServerError(status_code), NotFoundError. See `references/exception-hierarchy-example.py` for the complete template.

- **`client.py`** -- HTTP client with exception mapping and auth retry:
  - **HTTP library choice:**
    - `httpx` (default) — for most sites (REST, GraphQL, batchexecute)
    - `curl_cffi` — for Cloudflare-protected sites. Uses Chrome TLS fingerprint
      impersonation to bypass bot detection without cookies or auth:
      ```python
      from curl_cffi import requests as curl_requests
      resp = curl_requests.get(url, impersonate="chrome")
      ```
      Use `curl_cffi` when Phase 1 detects Cloudflare (`cf-ray` header, challenge page).
      Add `curl_cffi, beautifulsoup4` to `setup.py` instead of `httpx`.
  - Centralized auth header/cookie injection
  - Automatic JSON parsing with response body verification
  - **Status code → exception mapping**: 401/403→`AuthError`, 404→`NotFoundError`, 429→`RateLimitError`, 5xx→`ServerError`
  - **Auth retry (3-attempt auto-refresh)**: On 401/403: attempt 0 = try current cookies, attempt 1 = reload from `auth.json` on disk, attempt 2 = headless browser refresh via `refresh_auth()` in `auth.py`. See HARNESS.md "Token Auto-Refresh" for the full pattern. The `auth.py.tpl` and `client_rest_*.py.tpl` templates generate this by default.
  - Exponential backoff for rate limits (see `references/polling-backoff-example.py`)
  - For apps with 3+ resource types: split into namespaced sub-clients (`client.notebooks.list()`, `client.sources.add()`)
  - See `references/client-architecture-example.py` for the full pattern

- **`auth.py`** -- handles token storage, refresh, expiry. Implementation depends on auth type:

  **For no-auth sites:** DO NOT create `auth.py`, `session.py`, or auth command groups.
  These files are dead code for public APIs and confuse users. The CLI should have
  NO auth-related files or commands. The only exception is if the site has optional
  auth (e.g., API key for write operations) — in that case, implement a minimal
  auth module.

  **For browser-delegated auth (Google, Microsoft, etc.):** Full playwright-cli login flow
  with cookie domain priority for international users.

  See `references/auth-strategies.md` for all patterns (browser login, cookie priority, API key, env var, context commands).
  Store cookies at `~/.config/cli-web-<app>/auth.json` with chmod 600.

- **Anti-bot resilient client construction** (when detected in Phase 2):
  - Extract session tokens via CDP first (cookies), then HTTP GET + HTML parsing (CSRF, session IDs)
  - **Never hardcode** build labels (`bl`), session IDs (`f.sid`), or CSRF tokens -- extract dynamically at runtime
  - Replicate same-origin headers captured during Phase 1 traffic (e.g., `x-same-domain: 1` for Google apps)
  - Implement auto-retry on 401/403: re-fetch homepage -> re-extract tokens -> retry once
  - See `references/google-batchexecute.md` for the complete Google pattern

- **RPC codec subpackage** (for non-REST protocols like batchexecute):
  When the API uses a non-REST protocol, add `core/rpc/` with:
  - `types.py` -- method ID enum, URL constants
  - `encoder.py` -- request encoding (protocol-specific format)
  - `decoder.py` -- response decoding (strip prefix, parse chunks, extract results)
  The `client.py` still exists but delegates encoding/decoding to `rpc/`.

- **Progress feedback** -- Use `rich>=13.0` spinners for operations >2s (suppress in --json mode). See `references/rich-output-example.py`.

- **JSON error output** -- `--json` mode errors are JSON too, not plain text. Standard codes: AUTH_EXPIRED, RATE_LIMITED, NOT_FOUND, SERVER_ERROR, NETWORK_ERROR. Implement via `utils/output.py` json_error().

- **All commands use `handle_errors(json_mode)` context manager** — centralizes error handling, exit codes (1=user, 2=system, 130=interrupt), and JSON errors. See `references/helpers-module-example.py`.

- **Generation commands support `--wait`, `--retry N`, `--output path`** — for agent-scriptable end-to-end workflows. See `references/polling-backoff-example.py`.

- **Windows UTF-8 fix** — Add at the top of `<app>_cli.py` before any imports that print:
  ```python
  import sys
  if sys.stdout.encoding and sys.stdout.encoding.lower() not in ("utf-8", "utf8"):
      try: sys.stdout.reconfigure(encoding="utf-8", errors="replace")
      except AttributeError: pass
  ```
- **HTML table parsers MUST extract ALL visible columns** — not just name/price,
  because missing fields in `--json` output make the CLI useless for filtering and analysis.
  If the site shows version, club, nation, stats, skills, weak foot — parse all of them.
  Empty fields in `--json` output = incomplete parser.
- Entry point: `cli-web-<app>` via setup.py console_scripts
- Namespace: `cli_web.*`
- Copy `repl_skin.py` from plugin for consistent REPL experience
- **`utils/helpers.py`** -- shared CLI helpers (generate for every CLI):
  - `resolve_partial_id(partial, items)` — prefix-match UUIDs for get/rename/delete
  - `handle_errors(json_mode)` — context manager replacing try/except in all commands
  - `require_notebook(notebook_arg)` — gets notebook ID from arg or persistent context
  - `sanitize_filename(name)` — safe filenames from artifact titles
  - `poll_until_complete(check_fn)` — exponential backoff polling
  - `get_context_value(key)` / `set_context_value(key, value)` — persistent context.json
  See `references/helpers-module-example.py` for the complete module.

> **Not all helpers apply to every CLI.** Include only what the CLI uses:
> `handle_errors` and `print_json` are always needed. `resolve_partial_id` only
> for UUID-based apps. `require_notebook`/context helpers only for apps with
> persistent context. `poll_until_complete` only for generation/async operations.

### REPL Implementation Rules (Critical)

These three bugs appear in almost every generated REPL. Get them right the first time:

**1. Use `shlex.split()`, never `line.split()`**

```python
# ✓ Correct — handles quoted args: players search "messi" -> ['players', 'search', 'messi']
import shlex
args = shlex.split(line)

# ✗ Wrong — produces: ['players', 'search', '"messi"'] — quotes become part of the value
args = line.split()
```

**2. Never pass `**ctx.params` to `cli.main()` in REPL dispatch**

```python
# ✓ Correct — preserve --json flag by prepending to args
repl_args = ["--json"] + args if ctx.obj.get("json") else args
cli.main(args=repl_args, standalone_mode=False)

# ✗ Wrong — ctx.params = {"json_mode": False} gets passed to Context.__init__()
# which doesn't accept it → TypeError: Context.__init__() got an unexpected
# keyword argument 'json_mode'
cli.main(args=args, standalone_mode=False, **ctx.params)
```

**3. Keep `_print_repl_help()` in sync with the actual command surface**

The `_print_repl_help()` function in `<app>_cli.py` is the user's first discovery surface — it's what they see when they type `help` in the REPL. It must mirror the real commands, including all key options. A REPL that shows outdated or incomplete help is confusing and makes the CLI feel broken.

```python
# ✓ Correct — help lists actual options users can pass
def _print_repl_help():
    _skin.info("Available commands:")
    print("  players list [OPTIONS]")
    print("    --position <GK|ST|CM|...>    Filter by position")
    print("    --rating-min N --rating-max N  Rating range")
    print("    --cheapest                   Sort cheapest first")

# ✗ Wrong — stale help doesn't mention new --position, --rating-min, etc.
def _print_repl_help():
    print("  players list [--min-price N]   List players with filters")
```

Rule: **every time you add options to a command, update `_print_repl_help()` in the same commit**.

---

**4. Use `@click.argument` for positional REPL params, not `@click.option("--x", required=True)`**

REPL commands show `players search <query>` in help. If `query` is a `--query` option,
users typing `players search messi` get "Error: Missing option '--query'".
Use positional arguments for natural command-line style:

```python
# ✓ Correct — users type: players search messi  OR  players get 21610
@players.command()
@click.argument("query")
def search(query): ...

@players.command()
@click.argument("player_id", type=int)
def get(player_id): ...

# ✗ Wrong — users get an error unless they type: players search --query messi
@players.command()
@click.option("--query", required=True)
def search(query): ...
```

Rule of thumb: if a command takes a single required value that would be a positional arg
in a shell command (`git checkout main`, `grep pattern`), use `@click.argument`.
Use `@click.option` only for optional or named parameters (`--rating-min`, `--platform`).

### Parallel Implementation (dispatch independent modules as subagents)

When the CLI has 3+ command groups (e.g., notebooks, sources, chat, artifacts),
dispatch parallel subagents -- one per command module. Each agent gets:
- The `<APP>.md` API spec for its resource
- The `client.py` and `auth.py` interfaces it depends on
- Clear scope: "Implement `commands/notebooks.py` with list, get, create, delete"

**Parallelization opportunities:**

| Independent from each other | Dispatch in parallel |
|----------------------------|---------------------|
| `commands/notebooks.py`, `commands/sources.py`, `commands/chat.py` | Yes -- each command file only depends on `client.py` |
| `rpc/encoder.py` and `rpc/decoder.py` | Yes -- encoder doesn't depend on decoder |
| `auth.py` and `models.py` | Yes -- no shared logic |
| `client.py` and `commands/*` | **No** -- commands depend on client |
| `<app>_cli.py` (entry point) | **Last** -- imports all commands, write after they're done |

**Implementation order (with maximum parallelism):**

```
Phase A (sequential): Write core foundation
  exceptions.py → client.py → auth.py (if needed) → models.py

Phase B (parallel): Dispatch ALL independent work simultaneously
  ┌─ Agent 1: commands/notebooks.py
  ├─ Agent 2: commands/sources.py
  ├─ Agent 3: commands/chat.py
  ├─ Agent 4: commands/artifacts.py
  ├─ Agent 5: rpc/encoder.py + rpc/decoder.py (if non-REST)
  └─ Agent 6 (background): test_core.py (unit tests for core modules)
  All run concurrently — each only depends on Phase A modules

Phase C (sequential): Wire everything together
  utils/helpers.py → <app>_cli.py → __main__.py → setup.py → copy repl_skin.py
```

**Key parallelism rules:**
- Dispatch independent command modules as parallel subagents (one per `commands/*.py` file)
- Start unit test writing as a background agent during command implementation
- Entry point (`<app>_cli.py`, `setup.py`) must come last (depends on all commands)

---

## Mandatory Smoke Check (Before Testing Phase)

Before invoking testing, install (`pip install -e .`) and verify:
1. `cli-web-<app> --help` loads
2. `cli-web-<app> auth status --json` shows valid (if auth-required)
3. `cli-web-<app> <resource> list --json` returns real data
4. One WRITE command works (if applicable)

**Red flags — fix before testing:**
- `wrb.fr`, `af.httprm` in output → decoder broken
- `[]` or `null` where data expected → wrong params or client-side operation
- Wrong field values (e.g., "3" instead of prompt text) → parser index mismatch
- Null write response → may be client-side, see `references/google-batchexecute.md` "Client-Side Operations"

Update phase state:
```bash
python ${CLAUDE_PLUGIN_ROOT}/scripts/phase-state.py complete <app> \
  --phase methodology --output <app>/agent-harness/
```

## Next Step

When implementation is complete and the smoke check passes, invoke the `testing`
skill to plan and write tests.

Do NOT skip testing -- every CLI must have comprehensive tests before publishing.

---

## Companion Skills

| Skill | When it activates |
|-------|------------------|
| `capture` | Phase 1 -- traffic recording (prerequisite for this skill) |
| `testing` | Phase 3 -- test writing, documentation |
| `standards` | Phase 4 -- publish, verify, smoke test |

---

## Integration

| Relationship | Skill |
|-------------|-------|
| **Preceded by** | `capture` (Phase 1) |
| **Followed by** | `testing` (Phase 3) |
| **References** | `traffic-patterns.md`, `auth-strategies.md`, `google-batchexecute.md`, `ssr-patterns.md`, `exception-hierarchy-example.py`, `client-architecture-example.py`, `polling-backoff-example.py`, `rich-output-example.py` |

---

## Reference Files

- **`references/traffic-patterns.md`** -- Common API patterns (REST, GraphQL, RPC)
- **`references/auth-strategies.md`** -- Auth implementation strategies
- **`references/google-batchexecute.md`** -- Google batchexecute RPC protocol spec
- **`references/ssr-patterns.md`** -- SSR framework patterns and data extraction strategies
- **`references/exception-hierarchy-example.py`** -- Complete exception hierarchy with HTTP status mapping
- **`references/client-architecture-example.py`** -- Namespaced sub-client pattern with auth retry
- **`references/polling-backoff-example.py`** -- Exponential backoff polling and rate-limit retry
- **`references/rich-output-example.py`** -- Rich progress bars, JSON error responses, table formatting