---
name: tunnel-doctor
description: Diagnoses and fixes conflicts between Tailscale and proxy/VPN tools (Shadowrocket, Clash, Surge) on macOS. Covers five conflict layers - (1) route hijacking, (2) HTTP proxy env var interception, (3) system proxy bypass, (4) SSH ProxyCommand double tunneling, and (5) VM/container runtime proxy propagation (OrbStack/Docker). Includes SOP for remote development via SSH tunnels with proxy-safe Makefile patterns. Use when Tailscale ping works but SSH/HTTP times out, when browser returns 503 but curl works, when git push fails with "failed to begin relaying via HTTP", when Docker pull times out behind TUN/VPN, when setting up Tailscale SSH to WSL instances, or when bootstrapping remote dev environments over Tailscale.
allowed-tools: Read, Grep, Edit, Bash
---

# Tunnel Doctor

Diagnose and fix conflicts when Tailscale coexists with proxy/VPN tools on macOS, with specific guidance for SSH access to WSL instances.

## Five Conflict Layers

Proxy/VPN tools on macOS create conflicts at five independent layers. Layers 1-3 affect Tailscale connectivity; Layer 4 affects SSH git operations; Layer 5 affects VM/container runtimes:

| Layer | What breaks | What still works | Root cause |
|-------|-------------|------------------|------------|
| 1. Route table | Everything (SSH, curl, browser) | `tailscale ping` | `tun-excluded-routes` adds `en0` route overriding Tailscale utun |
| 2. HTTP env vars | `curl`, Python requests, Node.js fetch | SSH, browser | `http_proxy` set without `NO_PROXY` for Tailscale |
| 3. System proxy (browser) | Browser only (HTTP 503) | SSH, `curl` (both with/without proxy) | Browser uses VPN system proxy; DIRECT rule routes via Wi-Fi, not Tailscale utun |
| 4. SSH ProxyCommand double tunnel | `git push/pull` (intermittent) | `ssh -T` (small data) | `connect -H` creates HTTP CONNECT tunnel redundant with Shadowrocket TUN; landing proxy drops large/long-lived transfers |
| 5. VM/Container proxy propagation | `docker pull`, `docker build` | Host `curl`, running containers | VM runtime (OrbStack/Docker Desktop) auto-injects or caches proxy config; removing proxy makes it worse (VM traffic via TUN → TLS timeout) |

## Diagnostic Workflow

### Step 1: Identify the Symptom

Determine which scenario applies:

- **Browser returns HTTP 503, but `curl` and SSH both work** → System proxy bypass conflict (Step 2C)
- **`local.<domain>` fails in browser/default `curl`, but direct/no-proxy request works** → Local vanity domain proxy interception (Step 2C-1)
- **Tailscale ping works, SSH works, but curl/HTTP times out** → HTTP proxy env var conflict (Step 2A)
- **Tailscale ping works, SSH/TCP times out** → Route conflict (Step 2B)
- **Remote dev server auth redirects to `localhost` → browser can't follow** → SSH tunnel needed (Step 2D)
- **`make status` / scripts curl to localhost fail with proxy** → localhost proxy interception (Step 2E)
- **`git push/pull` fails with `FATAL: failed to begin relaying via HTTP`** → SSH double tunnel (Step 2F)
- **`docker build` `RUN apk/apt` fails with `Connection refused` instantly** → OrbStack transparent proxy + TUN conflict (Step 2G-1, fix: `--network host`)
- **`docker pull` fails with `TLS handshake timeout`** → VM proxy misconfiguration (Step 2G-2, fix: `docker.json` with `host.internal`)
- **Container healthcheck `(unhealthy)` but app runs fine** → Lowercase proxy env var leak (Step 2G-4, fix: clear `http_proxy`+`HTTP_PROXY`)
- **`docker build` can't fetch base images** → VM/container proxy propagation (Step 2G)
- **`git clone` fails with `Connection closed by 198.18.x.x`** → TUN DNS hijack for SSH (Step 2H)
- **SSH connects but `operation not permitted`** → Tailscale SSH config issue (Step 4)
- **SSH connects but `be-child ssh` exits code 1** → WSL snap sandbox issue (Step 5)
- **TCP port 22 reachable (`nc -z` succeeds) but SSH fails with `kex_exchange_identification: Connection closed`** → Tailscale SSH proxy intercept on WSL (Step 5A)
- **`tailscale ssh` returns "not available on App Store builds"** → Wrong Tailscale distribution on macOS (Step 5B)

**Key distinctions**:
- SSH does NOT use `http_proxy`/`NO_PROXY` env vars. If SSH works but HTTP doesn't → Layer 2.
- `curl` uses `http_proxy` env var, NOT the system proxy. Browser uses system proxy (set by VPN). If `curl` works but browser doesn't → Layer 3.
- If `tailscale ping` works but regular `ping` doesn't → Layer 1 (route table corrupted).
- If `ssh -T git@github.com` works but `git push` fails intermittently → Layer 4 (double tunnel).
- If host `curl https://...` works but `docker pull` times out → Layer 5 (VM proxy propagation).
- If `docker pull` works but `docker build` `RUN apk add` fails instantly with `Connection refused` → OrbStack transparent proxy broken by TUN (Step 2G-1).
- If container healthcheck shows `(unhealthy)` but app works → lowercase `http_proxy` leaked into container (Step 2G-4).
- If DNS resolves to `198.18.x.x` virtual IPs → TUN DNS hijack (Step 2H).
- If `nc -z` succeeds on port 22 but SSH gets no banner (`kex_exchange_identification`) → Tailscale SSH proxy intercept (Step 5A). Confirm with `tcpdump -i any port 22` on the remote — 0 packets means Tailscale intercepts above the kernel.
- If `tailscale ssh` fails with "not available on App Store builds" → install Standalone Tailscale (Step 5B).

### Fast Path: Run Automated Checks

For common macOS conflicts (env proxy, system proxy exceptions, direct/proxy path split, local TLS trust), run:

```bash
python3 scripts/quick_diagnose.py --host local.claude4.dev --url https://local.claude4.dev/health
```

Optional route ownership check for a Tailscale destination:

```bash
python3 scripts/quick_diagnose.py --host <target-host> --url http://<target-host>:<port>/health --tailscale-ip <100.x.x.x>
```

Interpretation:
- `direct=PASS` + `forced_proxy=FAIL` = host must bypass proxy (`skip-proxy` + `NO_PROXY`).
- `strict_tls=FAIL` + `direct=PASS` = path is reachable; trust issue only (install/trust local CA).
- `host in scutil exceptions: no` = browser/system clients still likely proxied.

### Step 2A: Fix HTTP Proxy Environment Variables

Check if proxy env vars are intercepting Tailscale HTTP traffic:

```bash
env | grep -i proxy
```

**Broken output** — proxy is set but `NO_PROXY` doesn't exclude Tailscale:
```
http_proxy=http://127.0.0.1:1082
https_proxy=http://127.0.0.1:1082
NO_PROXY=localhost,127.0.0.1          ← Missing Tailscale!
```

**Fix** — add Tailscale MagicDNS domain + CIDR to `NO_PROXY`:

```bash
export NO_PROXY=localhost,127.0.0.1,.ts.net,100.64.0.0/10,192.168.*,10.*,172.16.*
```

| Entry | Covers | Why |
|-------|--------|-----|
| `.ts.net` | MagicDNS domains (`host.tailnet.ts.net`) | Matched before DNS resolution |
| `100.64.0.0/10` | Tailscale IPs (`100.64.*` – `100.127.*`) | Precise CIDR, no public IP false positives |
| `192.168.*,10.*,172.16.*` | RFC 1918 private networks | LAN should never be proxied |

**Two layers complement each other**: `.ts.net` handles domain-based access, `100.64.0.0/10` handles direct IP access.

**NO_PROXY syntax pitfalls** — see [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) for the compatibility matrix.

**Go `net/http` CIDR caveat**: Go's standard `net/http` does NOT support CIDR notation in `NO_PROXY`. Setting `NO_PROXY=100.64.0.0/10` works for curl and Python, but Go programs (including Tailscale-adjacent tooling) will still send traffic through the proxy. The fix is to use MagicDNS hostnames (e.g., `workstation-4090-wsl`) instead of raw IPs, or add explicit hostnames to `NO_PROXY`:

```bash
# WRONG for Go programs — CIDR is silently ignored
NO_PROXY=100.64.0.0/10 go-program http://100.101.102.103:8002/health  # → goes through proxy

# CORRECT — use hostname (matched as suffix) or explicit IP
export NO_PROXY=localhost,127.0.0.1,.ts.net,workstation-4090-wsl,100.101.102.103,192.168.*,10.*,172.16.*
```

This is especially relevant when accessing Tailscale services from Go-based tools (e.g., custom CLIs, Go test suites hitting remote APIs).

Verify the fix:

```bash
# Both must return HTTP 200:
NO_PROXY="...(new value)..." curl -s --connect-timeout 5 http://<host>.ts.net:<port>/health -w "HTTP %{http_code}\n"
NO_PROXY="...(new value)..." curl -s --connect-timeout 5 http://<tailscale-ip>:<port>/health -w "HTTP %{http_code}\n"
```

Then persist in shell config (`~/.zshrc` or `~/.bashrc`).

### Step 2B: Detect Route Conflicts

Check if a proxy tool hijacked the Tailscale CGNAT range:

```bash
route -n get <tailscale-ip>
```

**Healthy output** — traffic goes through Tailscale interface:
```
destination: 100.64.0.0
interface: utun7    # Tailscale interface (utunN varies)
```

**Broken output** — proxy hijacked the route:
```
destination: 100.64.0.0
gateway: 192.168.x.1    # Default gateway
interface: en0           # Physical interface, NOT Tailscale
```

**Important**: Not all `utun` interfaces are Tailscale's. Verify which utun belongs to Tailscale before concluding the route is correct:

```bash
# Find Tailscale's utun interface (has a 100.x.x.x IP)
ifconfig | grep -A2 'inet 100\.'
```

Quick indicators by MTU:
- **MTU 1280** → typically Tailscale
- **MTU 4064** → typically Shadowrocket TUN

If `route -n get` shows traffic going to a utun with MTU 4064, it is hitting Shadowrocket's TUN, not Tailscale — this is still a route conflict even though the interface name starts with `utun`.

Confirm with full route table:

```bash
netstat -rn | grep 100.64
```

Two competing routes indicate a conflict:
```
100.64/10  192.168.x.1   UGSc  en0       ← Proxy added this (wins)
100.64/10  link#N        UCSI  utun7     ← Tailscale route (loses)
```

**Root cause**: On macOS, `UGSc` (Static Gateway) takes priority over `UCSI` (Cloned Static Interface) for the same prefix length.

### Step 2C: Fix System Proxy Bypass (Browser 503)

**Symptom**: Browser shows HTTP 503 for `http://<tailscale-ip>:<port>`, but both `curl --noproxy '*'` and `curl` (with proxy env var) return 200. SSH also works.

**Root cause**: The browser uses the system proxy configured by the VPN profile (Shadowrocket/Clash/Surge). The proxy matches `IP-CIDR,100.64.0.0/10,DIRECT` and tries to connect directly — but "directly" means via the Wi-Fi interface (en0), NOT through Tailscale's utun interface. The proxy process itself doesn't have a route to Tailscale IPs, so the connection fails with 503.

**Diagnosis**:

```bash
# curl with proxy env var works (curl connects to proxy port, but traffic flows differently)
curl -s -o /dev/null -w "%{http_code}" http://<tailscale-ip>:<port>/
# → 200

# Browser gets 503 because it goes through the VPN system proxy, not http_proxy env var
```

**Fix** — add Tailscale CGNAT range to `skip-proxy` in the proxy tool config:

For Shadowrocket, in `[General]`:
```
skip-proxy = 192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12, 100.64.0.0/10, localhost, *.local, captive.apple.com
```

`skip-proxy` tells the system "bypass the proxy entirely for these addresses." The browser then connects directly through the OS network stack, where Tailscale's routing table correctly handles the traffic.

**Why `skip-proxy` works but `tun-excluded-routes` doesn't**:
- `skip-proxy`: Bypasses the HTTP proxy layer only. Traffic still flows through the TUN interface and Tailscale utun handles it. Safe.
- `tun-excluded-routes`: Removes the CIDR from the TUN routing entirely. This creates a competing `en0` route that overrides Tailscale. Breaks everything.

#### Step 2C-1: Fix Local Vanity Domain Interception (`local.<domain>`)

**Symptom**: `https://local.<domain>` fails in browser or default `curl`, but succeeds with direct/no-proxy command:

```bash
env -u http_proxy -u https_proxy curl -k -I https://local.<domain>/health
# -> 200
curl -I https://local.<domain>/health
# -> proxy CONNECT then TLS reset/failure
```

**Root cause**: The domain is routed through system/shell proxy instead of local direct path.

**Fix**:
1. Add domain to proxy app bypass list (`skip-proxy` for Shadowrocket).
2. Add domain to shell bypass list (`NO_PROXY`/`no_proxy`).
3. If local TLS uses internal CA, trust the local root certificate.

```bash
# ~/.zshrc
export NO_PROXY=localhost,127.0.0.1,.ts.net,100.64.0.0/10,192.168.*,10.*,172.16.*,local.<domain>,www.local.<domain>
export no_proxy="$NO_PROXY"
```

**Verification**:

```bash
python3 scripts/quick_diagnose.py --host local.<domain> --url https://local.<domain>/health
```

Expected:
- `host in NO_PROXY: yes`
- `host in scutil exceptions: yes`
- `ambient=PASS` and `direct=PASS`

### Step 2D: Fix Auth Redirect for Remote Dev (SSH Tunnel)

**Symptom**: Dev server runs on a remote machine (e.g., Mac Mini via Tailscale). You access `http://<tailscale-ip>:3010` in the browser. Login/signup works, but after auth, the app redirects to `http://localhost:3010/` which fails — `localhost` on your machine isn't running the dev server.

**Root cause**: The app's `APP_URL` (or equivalent) is set to `http://localhost:3010`. Auth libraries (Better-Auth, NextAuth, etc.) use this URL for callback redirects. Changing `APP_URL` to the Tailscale IP introduces Shadowrocket proxy conflicts and breaks local development on the remote machine.

**Fix** — SSH local port forwarding. This avoids all three conflict layers entirely:

```bash
# Forward local port 3010 to remote machine's localhost:3010
ssh -NL 3010:localhost:3010 <tailscale-ip>

# Or with autossh for auto-reconnect (recommended for long sessions)
autossh -M 0 -f -N -L 3010:localhost:3010 \
    -o "ServerAliveInterval=30" \
    -o "ServerAliveCountMax=3" \
    -o "ExitOnForwardFailure=yes" \
    <tailscale-ip>
```

Now access `http://localhost:3010` in the browser. Auth redirects to `localhost:3010` → tunnel → remote dev server → works correctly.

**Why this is the best approach**:
- No `.env` changes needed — `APP_URL=http://localhost:3010` works everywhere
- No Shadowrocket conflicts — `localhost` is always in `skip-proxy`
- No code changes — same behavior as local development
- Industry standard — VS Code Remote SSH, GitHub Codespaces use the same pattern

**Install autossh**: `brew install autossh` (macOS) or `apt install autossh` (Linux)

**Kill background tunnel**: `pkill -f 'autossh.*<tailscale-ip>'`

### Step 2E: Fix localhost Proxy Interception in Scripts

**Symptom**: Makefile targets or scripts that `curl` localhost (health checks, warmup routes) fail or timeout when `http_proxy` is set globally in the shell.

**Root cause**: `http_proxy=http://127.0.0.1:1082` is set in `~/.zshrc` but `no_proxy` doesn't include `localhost`. All curl commands send localhost requests through the proxy.

**Fix** — add `--noproxy localhost` to all localhost curl commands in scripts:

```makefile
# WRONG — fails when http_proxy is set
@curl -sf http://localhost:9000/minio/health/live && echo "OK"

# CORRECT — always bypasses proxy for localhost
@curl --noproxy localhost -sf http://localhost:9000/minio/health/live && echo "OK"
```

Alternatively, set `no_proxy` globally in `~/.zshrc`:

```bash
export no_proxy=localhost,127.0.0.1
```

### Step 2F: Fix SSH ProxyCommand Double Tunnel (git push/pull failures)

**Symptom**: `ssh -T git@github.com` succeeds consistently, but `git push` or `git pull` fails intermittently with:

```
FATAL: failed to begin relaying via HTTP.
Connection closed by UNKNOWN port 65535
```

Small operations (auth, fetch metadata) work; large data transfers fail.

**Root cause**: When Shadowrocket TUN is active, it already routes all TCP traffic through its VPN tunnel. If SSH config also uses `ProxyCommand connect -H`, data flows through two proxy layers — the landing proxy drops large/long-lived HTTP CONNECT connections.

**Diagnosis**:

```bash
# 1. Confirm Shadowrocket TUN is active
ifconfig | grep '^utun'

# 2. Check SSH config for ProxyCommand
grep -A5 'Host github.com' ~/.ssh/config

# 3. Confirm: removing ProxyCommand fixes push
GIT_SSH_COMMAND="ssh -o ProxyCommand=none" git push origin main
```

**Fix** — remove ProxyCommand and switch to `ssh.github.com:443`. See [references/proxy_conflict_reference.md § SSH ProxyCommand and Git Operations](references/proxy_conflict_reference.md) for the full SSH config, why port 443 helps, and fallback options when VPN is off.

### Step 2G: Fix VM/Container Runtime Proxy Propagation (Docker pull/build failures)

**Symptom**: `docker pull` or `docker build` fails with `net/http: TLS handshake timeout`, `Connection refused` from Alpine/Debian repos, or `Internal Server Error` from `auth.docker.io`, while host `curl` to the same URLs works fine.

**Applies to**: OrbStack, Docker Desktop, or any VM-based Docker runtime on macOS with Shadowrocket/Clash TUN active.

**Root cause**: VM-based Docker runtimes (OrbStack, Docker Desktop) run the Docker daemon inside a lightweight VM. The VM's outbound traffic takes a different network path than host processes:

```
Host process (curl):   Process → TUN (Shadowrocket) → landing proxy → internet ✅
VM process (Docker):   Docker daemon → VM bridge → host network → TUN → ??? ❌
```

The TUN handles host-originated traffic correctly but may drop or delay VM-bridged traffic (different TCP stack, MTU, keepalive behavior).

**Critical distinction: `docker pull` vs `docker build` use different proxy paths**:

| Operation | Proxy source | What controls it |
|-----------|-------------|------------------|
| `docker pull` | Docker daemon config | `~/.orbstack/config/docker.json` or `docker info` |
| `docker build` (`RUN apt/apk`) | Build container env | `--build-arg http_proxy=...` or `--network host` |
| `docker run` | Container env | `-e http_proxy=...` or inherited from daemon |

Fixing `docker.json` alone will NOT fix `docker build` — the `RUN` commands inside the build container don't inherit daemon proxy settings.

**Diagnosis** — identify which sub-problem:

```bash
# 1. Can the Docker daemon pull images?
docker pull --quiet alpine:latest 2>&1

# 2. Can a RUN command inside a build reach the internet?
docker build --no-cache - <<'EOF' 2>&1
FROM alpine:latest
RUN apk update && echo "APK OK"
EOF

# 3. Can a running container reach the internet?
docker run --rm alpine:latest sh -c "apk update 2>&1 | head -3"
```

**Four sub-problems and their fixes**:

#### 2G-1: `docker build` fails but host works (most common with OrbStack + Shadowrocket)

**Symptom**: `RUN apk add` or `RUN apt-get install` inside `docker build` fails with `Connection refused` instantly (< 0.2s), even though host `curl` to the same URL works.

**Root cause**: OrbStack's `network_proxy: auto` creates a transparent proxy inside the VM that intercepts all HTTPS traffic. When Shadowrocket TUN is also active, the transparent proxy's upstream connection breaks — it redirects HTTPS to `127.0.0.1` inside the VM, which has nothing listening.

**Diagnosis**:

```bash
# Verify: inside the container, HTTPS goes to 127.0.0.1 (broken transparent proxy)
docker run --rm alpine:latest sh -c "wget -q --timeout=5 -O /dev/null https://dl-cdn.alpinelinux.org/ 2>&1"
# → "wget: can't connect to remote host (127.0.0.1): Connection refused"
#                                        ^^^^^^^^^^^^ This is the smoking gun

# Verify: --network host bypasses the VM bridge and works
docker run --rm --network host alpine:latest sh -c "apk update 2>&1 | head -3"
# → "v3.23.x ... OK: 27431 distinct packages available"  ← Works!
```

**Fix** — use `--network host` for docker build:

```bash
docker build --network host -f Dockerfile -t myimage .
```

This bypasses OrbStack's VM network bridge entirely. The build container uses the host's network stack directly, where Shadowrocket TUN correctly handles traffic.

**Trade-off**: `--network host` disables build-time network isolation. For CI/CD, prefer fixing the proxy config (2G-2). For local development, `--network host` is the pragmatic fix.

**Permanent fix** — if all your builds need this, add to `~/.docker/daemon.json` or use a shell alias:

```bash
# Shell alias (add to ~/.zshrc)
alias docker-build='docker build --network host'
```

#### 2G-2: OrbStack auto-detects and caches proxy config

OrbStack's `network_proxy: auto` reads `http_proxy` from the shell environment and configures the Docker daemon. The config is stored in `~/.orbstack/config/docker.json`.

**Key behaviors**:
- `network_proxy: auto` — OrbStack reads host env, creates transparent proxy in VM
- `network_proxy: none` — Disables transparent proxy, but VM bridge traffic still routes through TUN (may timeout)
- `docker.json` — Controls `docker pull` proxy, NOT `docker build` RUN commands

**Diagnosis**:

```bash
# Check all three layers
echo "=== OrbStack config ==="
orbctl config get network_proxy

echo "=== docker.json (daemon proxy) ==="
cat ~/.orbstack/config/docker.json

echo "=== Docker info (effective proxy) ==="
docker info | grep -iE "proxy|No Proxy"
```

**Fix** — configure `docker.json` with `host.internal` (OrbStack resolves this to the host IP):

```bash
python3 -c "
import json, os
config = {
    'proxies': {
        'http-proxy': 'http://host.internal:1082',
        'https-proxy': 'http://host.internal:1082',
        'no-proxy': 'localhost,127.0.0.1,::1,192.168.128.0/24,100.64.0.0/10,host.internal,*.local'
    }
}
path = os.path.expanduser('~/.orbstack/config/docker.json')
json.dump(config, open(path, 'w'), indent=2)
print('Written:', path)
"

# Full restart required
orbctl stop && sleep 3 && orbctl start
```

**Important**: Use `host.internal` (OrbStack-specific), NOT `127.0.0.1` (points to VM loopback) and NOT `host.docker.internal` (may not resolve in all contexts).

**Why NOT remove the proxy**: When TUN is active, removing the Docker proxy means VM traffic goes directly through the bridge → TUN path, which causes TLS handshake timeouts. The proxy provides a working outbound channel.

#### 2G-3: Removing proxy makes Docker worse (counter-intuitive)

| Docker config | Traffic path | Result |
|---------------|-------------|--------|
| Proxy ON (`127.0.0.1`), no `no-proxy` | Docker → VM proxy → ??? | `docker pull` may work, localhost probes ❌ |
| Proxy ON (`host.internal`), + `no-proxy` | External: Docker → host proxy → internet; Local: direct | **Both work ✅** |
| Proxy OFF (`network_proxy: none`) | Docker → VM bridge → host → TUN → internet | TLS timeout ❌ |
| **`--network host` (build only)** | **Build container → host network → TUN → internet** | **Build works ✅** |

**Decision tree**:
- `docker pull` broken → Fix `docker.json` with `host.internal` proxy (2G-2)
- `docker build` broken → Use `--network host` (2G-1) OR pass `--build-arg http_proxy=http://host.internal:1082`
- Both broken → Fix both: `docker.json` + `--network host`

#### 2G-4: Deploy scripts and container healthchecks probe localhost through proxy

Deploy scripts that `curl localhost` inside containers or Docker healthchecks that use `wget http://localhost` will route through the proxy if env vars leak into the container.

**Common symptoms**:
- Container healthcheck shows `(unhealthy)` but the app inside is running fine
- `wget: can't connect to remote host (127.0.0.1): Connection refused` in healthcheck logs (proxy port, not app port)

**Root cause**: Docker inherits uppercase AND lowercase proxy env vars from the host. Many tools only clear uppercase (`HTTP_PROXY=`) but forget lowercase (`http_proxy=http://127.0.0.1:1082`). The healthcheck `wget` uses lowercase.

**Fix in docker-compose.yml** — clear BOTH cases:

```yaml
environment:
  # Must clear both uppercase and lowercase — wget/curl check different vars
  - HTTP_PROXY=
  - HTTPS_PROXY=
  - http_proxy=
  - https_proxy=
  - NO_PROXY=*
  - no_proxy=*
```

**Fix in deploy scripts**:

```bash
_local_bypass="localhost,127.0.0.1,::1"
export NO_PROXY="${_local_bypass}${NO_PROXY:+,${NO_PROXY}}"
export no_proxy="$NO_PROXY"

# Use 127.0.0.1 instead of localhost in probe URLs (some proxy implementations
# only match exact string "localhost" in no-proxy, not the resolved IP)
curl http://127.0.0.1:3001/health   # ✅ bypasses proxy
curl http://localhost:3001/health    # ❌ may still go through proxy
```

**Verify the fix**:

```bash
# Docker proxy check (should show proxy + no-proxy)
docker info | grep -iE "proxy|No Proxy"

# Pull test
docker pull --quiet hello-world

# Build test (the real verification)
docker build --network host --no-cache - <<'EOF'
FROM alpine:latest
RUN apk update && echo "BUILD OK"
EOF

# Container env check (no proxy leak)
docker exec <container> env | grep -i proxy
# Expected: all empty or not set
```

### Step 2H: Fix TUN DNS Hijack for SSH/Git (198.18.x.x virtual IPs)

**Symptom**: `git clone/fetch/push` fails with `Connection closed by 198.18.0.x port 443`. `ssh -T git@github.com` may also fail. DNS resolution returns `198.18.x.x` addresses instead of real IPs.

**Root cause**: Shadowrocket TUN intercepts all DNS queries and returns virtual IPs in the `198.18.0.0/15` range. It then routes traffic to these virtual IPs through the TUN for protocol-aware proxying. HTTP/HTTPS works because the landing proxy understands these protocols, but SSH-over-443 (used by GitHub) gets mishandled — the TUN sees port 443 traffic, expects HTTPS, and drops the SSH handshake.

**Diagnosis**:

```bash
# DNS returns virtual IP (TUN hijack)
nslookup ssh.github.com
# → 198.18.0.26  ← Shadowrocket virtual IP, NOT real GitHub IP

# Direct IP works (bypasses DNS hijack)
ssh -o HostName=140.82.112.35 -o Port=443 git@github.com
# → "Hi user! You've successfully authenticated"
```

**Fix** — use direct IP in SSH config to bypass DNS hijack:

```bash
# ~/.ssh/config
Host github.com
    HostName 140.82.112.35    # GitHub SSH server real IP (bypasses TUN DNS hijack)
    Port 443
    User git
    ServerAliveInterval 60
    ServerAliveCountMax 3
    IdentityFile ~/.ssh/id_ed25519
```

**GitHub SSH server IPs** (as of 2026, verify with `dig +short ssh.github.com @8.8.8.8`):
- `140.82.112.35` (primary)
- `140.82.112.36` (alternate)

**Trade-off**: Hardcoded IPs break if GitHub changes them. Monitor `ssh -T git@github.com` — if it starts failing, update the IP. A cron job can automate this:

```bash
# Weekly check (add to crontab)
0 9 * * 1 dig +short ssh.github.com @8.8.8.8 | head -1 > /tmp/github-ssh-ip.txt
```

**Alternative** (if you control Shadowrocket rules): Add GitHub SSH IPs to DIRECT rule so TUN passes them through without protocol inspection:

```
IP-CIDR,140.82.112.0/24,DIRECT
IP-CIDR,192.30.252.0/22,DIRECT
```

This is more robust but requires proxy tool config access.

### Step 3: Fix Proxy Tool Configuration

Identify the proxy tool and apply the appropriate fix. See [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) for detailed instructions per tool.

**Key principle**: Do NOT use `tun-excluded-routes` to exclude `100.64.0.0/10`. This causes the proxy to add a `→ en0` route that overrides Tailscale. Instead, let the traffic enter the proxy TUN and use a DIRECT rule to pass it through.

**Universal fix** — add this rule to any proxy tool:
```
IP-CIDR,100.64.0.0/10,DIRECT
IP-CIDR,fd7a:115c:a1e0::/48,DIRECT
```

After applying fixes, verify:

```bash
route -n get <tailscale-ip>
# Should show Tailscale utun interface, NOT en0
```

### Step 4: Configure Tailscale SSH ACL

If SSH connects but returns `operation not permitted`, the Tailscale ACL may require browser authentication for each connection.

At [Tailscale ACL admin](https://login.tailscale.com/admin/acls), ensure the SSH section uses `"action": "accept"`:

```json
"ssh": [
    {
        "action": "accept",
        "src": ["autogroup:member"],
        "dst": ["autogroup:self"],
        "users": ["autogroup:nonroot", "root"]
    }
]
```

**Note**: `"action": "check"` requires browser authentication each time. Change to `"accept"` for non-interactive SSH access.

### Step 5: Fix WSL Tailscale Installation

If SSH connects and ACL passes but fails with `be-child ssh` exit code 1 in tailscaled logs, the snap-installed Tailscale has sandbox restrictions preventing SSH shell execution.

**Diagnosis** — check WSL tailscaled logs:

```bash
# For snap installs:
sudo journalctl -u snap.tailscale.tailscaled -n 30 --no-pager

# For apt installs:
sudo journalctl -u tailscaled -n 30 --no-pager
```

Look for:
```
access granted to user@example.com as ssh-user "username"
starting non-pty command: [/snap/tailscale/.../tailscaled be-child ssh ...]
Wait: code=1
```

**Fix** — replace snap with apt installation:

```bash
# Remove snap version
sudo snap remove tailscale

# Install apt version
curl -fsSL https://tailscale.com/install.sh | sh

# Start with SSH enabled
sudo tailscale up --ssh
```

**Important**: The new installation may assign a different Tailscale IP. Check with `tailscale status --self`.

### Step 5A: Fix Tailscale SSH Proxy Silent Failure on WSL

**Symptom**: TCP port 22 is reachable (`nc -z -w 5 <ip> 22` succeeds), but SSH fails immediately with:

```
kex_exchange_identification: Connection closed by remote host
```

No SSH banner is ever received. This happens even with apt-installed Tailscale (not snap).

**Root cause**: When `tailscale up --ssh` is enabled on WSL, Tailscale intercepts port 22 connections at the application layer (above the kernel network stack). If Tailscale's built-in SSH proxy malfunctions, it accepts the TCP connection but immediately closes it before sending the SSH banner.

**Key diagnostic** — on the WSL instance:

```bash
# This will show 0 packets even during active SSH attempts
sudo tcpdump -i any port 22 -c 5 -w /dev/null 2>&1
```

Zero packets means Tailscale is intercepting connections before they reach the kernel network stack. The kernel's `sshd` never sees the connection.

**Distinction from Step 5**: Step 5 covers snap sandbox issues where `be-child ssh` fails. This is a different problem — Tailscale's SSH proxy itself silently fails, regardless of installation method.

**Fix** — disable Tailscale's SSH proxy and use regular sshd:

```bash
# On the WSL instance:
sudo tailscale up --ssh=false

# Verify sshd is running
sudo service ssh status
# If not running:
sudo service ssh start

# Verify from the client machine:
ssh -o ConnectTimeout=10 <user>@<tailscale-ip> 'echo SSH_OK'
```

After disabling Tailscale SSH, connections go through the kernel network stack to `sshd` as normal. The Tailscale ACL `"action": "accept"` in Step 4 is no longer relevant — authentication is handled by `sshd` using SSH keys or passwords.

**When to keep `--ssh` enabled**: Only if you specifically need Tailscale's SSH features (ACL-based access control, no SSH key management). If standard sshd works, prefer `--ssh=false` for reliability.

### Step 5B: Fix App Store Tailscale on macOS (Missing `tailscale ssh`)

**Symptom**: Running `tailscale ssh` returns:

```
The 'tailscale ssh' subcommand is not available on macOS builds
distributed through the App Store or TestFlight.
```

**Root cause**: The App Store version of Tailscale for macOS is sandboxed and does not include the `tailscale ssh` subcommand.

**Fix** — install the Standalone version:

1. Uninstall the App Store version (delete from /Applications)
2. Download the Standalone build from https://pkgs.tailscale.com/stable/#macos
3. Install to /Applications

**Post-install CLI setup**: The standalone `tailscale` CLI binary is embedded inside the app bundle. Add an alias to your shell config:

```bash
# ~/.zshrc
alias tailscale="/Applications/Tailscale.app/Contents/MacOS/Tailscale"
```

Verify:

```bash
source ~/.zshrc
tailscale version
tailscale ssh <user>@<hostname>   # Should work now
```

### Step 6: Verify End-to-End

Run a complete connectivity test:

```bash
# 1. Check route is correct (must show Tailscale's utun, not en0 or Shadowrocket's utun)
route -n get <tailscale-ip>
# Also confirm which utun is Tailscale's:
ifconfig | grep -A2 'inet 100\.'

# 2. Test TCP connectivity
nc -z -w 5 <tailscale-ip> 22

# 3. Test SSH
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no <user>@<tailscale-ip> 'echo SSH_OK && hostname && whoami'
```

All three must pass. If step 1 fails, revisit Step 3. If step 1 shows wrong utun (e.g., Shadowrocket's utun with MTU 4064 instead of Tailscale's with MTU 1280), that is also a route conflict. If step 2 passes but step 3 fails with `kex_exchange_identification`, revisit Step 5A (Tailscale SSH proxy intercept). If step 2 fails, check WSL sshd or firewall. If step 3 fails with other errors, revisit Steps 4-5.

## SOP: Remote Development via Tailscale

Proactive setup guide for remote development over Tailscale with proxy tools. Follow these steps **before** encountering problems.

### Prerequisites

- Tailscale installed and running on both machines
- Proxy tool (Shadowrocket/Clash/Surge) configured with Tailscale compatibility (see Step 3 above)
- SSH access working: `ssh <tailscale-ip> 'echo ok'`

### 1. Proxy-Safe Makefile Pattern

Any Makefile target that curls `localhost` must use `--noproxy localhost`. This is required because `http_proxy` is often set globally in `~/.zshrc` (common in China), and Make inherits shell environment variables.

```makefile
## ── Health Checks ─────────────────────────────────────

status:                ## Health check dashboard
	@echo "=== Dev Infrastructure ==="
	@docker exec my-postgres pg_isready -U postgres 2>/dev/null && echo "PostgreSQL: OK" || echo "PostgreSQL: FAIL"
	@curl --noproxy localhost -sf http://localhost:9000/minio/health/live >/dev/null 2>&1 && echo "MinIO: OK" || echo "MinIO: FAIL"
	@curl --noproxy localhost -sf http://localhost:3001/api/status >/dev/null 2>&1 && echo "API: OK" || echo "API: FAIL"

## ── Route Warmup ──────────────────────────────────────

warmup:                ## Pre-compile key routes (run after dev server is ready)
	@echo "Warming up dev server routes..."
	@echo -n "  /api/health → " && curl --noproxy localhost -s -o /dev/null -w '%{http_code} (%{time_total}s)\n' http://localhost:3010/api/health
	@echo -n "  /            → " && curl --noproxy localhost -s -o /dev/null -w '%{http_code} (%{time_total}s)\n' http://localhost:3010/
	@echo "Warmup complete."
```

**Rules**:
- Every `curl http://localhost` call MUST include `--noproxy localhost`
- Docker commands (`docker exec`) are unaffected by `http_proxy` — no fix needed
- `redis-cli`, `pg_isready` connect via TCP directly — no fix needed

### 2. SSH Tunnel Makefile Targets

Add these targets for remote development via Tailscale SSH tunnels:

```makefile
## ── Remote Development ────────────────────────────────

REMOTE_HOST    ?= <tailscale-ip>
TUNNEL_FORWARD ?= -L 3010:localhost:3010

tunnel:                ## SSH tunnel to remote machine (foreground)
	ssh -N $(TUNNEL_FORWARD) $(REMOTE_HOST)

tunnel-bg:             ## SSH tunnel to remote machine (background, auto-reconnect)
	autossh -M 0 -f -N $(TUNNEL_FORWARD) \
		-o "ServerAliveInterval=30" \
		-o "ServerAliveCountMax=3" \
		-o "ExitOnForwardFailure=yes" \
		$(REMOTE_HOST)
	@echo "Tunnel running in background. Kill with: pkill -f 'autossh.*$(REMOTE_HOST)'"
```

**Design decisions**:

| Choice | Rationale |
|--------|-----------|
| `?=` (conditional assign) | Allows override: `make tunnel REMOTE_HOST=100.x.x.x` |
| `TUNNEL_FORWARD` as variable | Supports multi-port: `make tunnel TUNNEL_FORWARD="-L 3010:localhost:3010 -L 9000:localhost:9000"` |
| `autossh -M 0` | Disables autossh's own monitoring port; relies on `ServerAliveInterval` instead (more reliable through NAT) |
| `ExitOnForwardFailure=yes` | Fails immediately if port is already bound, instead of silently running without tunnel |
| Kill hint uses `autossh.*$(REMOTE_HOST)` | Precise pattern — won't accidentally kill other SSH sessions |

**Install autossh**: `brew install autossh` (macOS) or `apt install autossh` (Linux/WSL)

### 3. Multi-Port Tunnels

When the project requires multiple services (dev server + object storage + API gateway):

```bash
# Forward multiple ports in one tunnel
make tunnel TUNNEL_FORWARD="-L 3010:localhost:3010 -L 9000:localhost:9000 -L 3001:localhost:3001"

# Or define a project-specific default in Makefile
TUNNEL_FORWARD ?= -L 3010:localhost:3010 -L 9000:localhost:9000
```

Each `-L` flag is independent. If one port is already bound locally, `ExitOnForwardFailure=yes` will abort the entire tunnel — fix the port conflict first.

### 4. SSH Non-Login Shell Setup

**This is a frequent source of "it works interactively but fails in scripts" bugs.** SSH non-login shells don't load `~/.zshrc` (or `~/.bashrc` on Linux), so tools installed via nvm, Homebrew, uv, cargo, or any shell-level manager won't be in `$PATH`. Proxy env vars set in `~/.zshrc` also won't be loaded.

This affects **all** remote commands run via `ssh user@host "command"`, including CI/CD pipelines, cron-triggered SSH, and Makefile remote targets. Prefix all remote commands with `source ~/.zshrc 2>/dev/null;` (macOS) or `source ~/.bashrc 2>/dev/null;` (Linux/WSL).

**Common failure**: `ssh user@host "uv run ..."` or `ssh user@host "node ..."` returns `command not found` even though the command works in an interactive SSH session.

See [references/proxy_conflict_reference.md § SSH Non-Login Shell Pitfall](references/proxy_conflict_reference.md) for details and examples.

For Makefile targets that run remote commands:

```makefile
REMOTE_CMD = ssh $(REMOTE_HOST) 'source ~/.zshrc 2>/dev/null; $(1)'

remote-status:         ## Check remote dev server status
	$(call REMOTE_CMD,curl --noproxy localhost -sf http://localhost:3010/api/health && echo "OK" || echo "FAIL")
```

### 5. End-to-End Workflow

#### First-time setup (remote machine)

```bash
# 1. Clone repo and install dependencies
ssh <tailscale-ip>
cd /path/to/project
git clone git@github.com:user/repo.git && cd repo
pnpm install  # Add --registry https://registry.npmmirror.com if in China

# 2. Copy .env from local machine (run on local)
scp .env <tailscale-ip>:/path/to/project/repo/.env

# 3. Start Docker infrastructure
make up && make status

# 4. Run database migrations
bun run db:migrate

# 5. Start dev server
bun run dev
```

#### Daily workflow (local machine)

```bash
# 1. Start tunnel
make tunnel-bg

# 2. Open browser
open http://localhost:3010

# 3. Auth, coding, testing — everything works as if local

# 4. When done, kill tunnel
pkill -f 'autossh.*<tailscale-ip>'
```

#### Why this works

```
Browser → localhost:3010 → SSH tunnel → Remote localhost:3010 → Dev server
                                     ↓
                              Auth redirects to localhost:3010
                                     ↓
                              Browser follows redirect → same tunnel → works
```

The key insight: `APP_URL=http://localhost:3010` in `.env` is correct for **both** local and remote development. The SSH tunnel makes the remote server's localhost accessible as the local machine's localhost. Auth callback redirects to `localhost:3010` always resolve correctly.

### 6. Checklist

Before starting remote development, verify:

- [ ] Tailscale connected: `tailscale status`
- [ ] SSH works: `ssh <tailscale-ip> 'echo ok'`
- [ ] Proxy tool configured: `[Rule]` has `IP-CIDR,100.64.0.0/10,DIRECT`
- [ ] `skip-proxy` includes `100.64.0.0/10`
- [ ] `tun-excluded-routes` does NOT include `100.64.0.0/10`
- [ ] `NO_PROXY` includes `.ts.net,100.64.0.0/10`
- [ ] `autossh` installed: `which autossh`
- [ ] Makefile curl commands have `--noproxy localhost`
- [ ] Remote dev server running: `ssh <ip> 'source ~/.zshrc 2>/dev/null; curl --noproxy localhost -sf http://localhost:3010/'`
- [ ] Tunnel works: `make tunnel-bg && curl -sf http://localhost:3010/`

## References

- [references/proxy_conflict_reference.md](references/proxy_conflict_reference.md) — Per-tool configuration (Shadowrocket, Clash, Surge), NO_PROXY syntax, SSH ProxyCommand, and conflict architecture
