---
name: fork-intelligence
description: Discover valuable GitHub fork divergence beyond stars. TRIGGERS - fork analysis, fork intelligence, find forks, valuable forks, fork divergence, fork discovery, upstream forks.
allowed-tools: Read, Bash, Grep, Glob
---

# Fork Intelligence

Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.

Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).

> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

## FIRST — TodoWrite Task Templates

**MANDATORY**: Select and load the appropriate template before any fork analysis.

### Template A — Full Analysis (new repository)

```
1. Get upstream baseline (stars, forks, default branch, last push)
2. List all forks with pagination, note timestamp clusters
3. Filter to unique-timestamp forks (skip bulk mirrors)
4. Check default branch divergence (ahead_by/behind_by)
5. Check non-default branches for all forks with recent push or >1 branch
6. Evaluate commit content, author emails, tags/releases
7. Cross-reference upstream PR history from fork owners
8. Tier ranking and cross-fork convergence analysis
9. Produce report with actionable recommendations
```

### Template B — Quick Scan (triage only)

```
1. Get upstream baseline
2. List forks, filter by timestamp clustering
3. Check default branch divergence only
4. Report forks with ahead_by > 0
```

### Template C — Targeted Fork Evaluation (specific fork)

```
1. Compare fork vs upstream on all branches
2. Examine commit messages and changed files
3. Check for tags/releases, open issues, PRs
4. Assess cherry-pick viability
```

---

## Signal Priority Order

Ranked by empirical reliability across 10 repositories. See [signal-priority.md](./references/signal-priority.md) for details.

| Rank | Signal                          | Reliability | What It Catches                                      |
| ---- | ------------------------------- | ----------- | ---------------------------------------------------- |
| 1    | **Branch-level divergence**     | Highest     | Work on feature branches (50%+ of substantive forks) |
| 2    | **Upstream PR cross-reference** | High        | Rebased/force-pushed work invisible to compare API   |
| 3    | **Tags/releases on fork**       | High        | Independent maintenance intent                       |
| 4    | **Commit email domains**        | High        | Institutional contributors (`@company.com`)          |
| 5    | **Timestamp clustering**        | Medium      | Eliminates 85%+ mirror noise                         |
| 6    | **Cross-fork convergence**      | Medium      | Reveals unmet upstream demand                        |
| 7    | **Stars**                       | Lowest      | Often anti-correlated with actual value              |

---

## Pipeline — 7 Steps

### Step 1: Upstream Baseline

```bash
UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'
```

### Step 2: List All Forks + Timestamp Clustering

```bash
# List all forks with activity signals
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'
```

**Timestamp clustering**: Forks sharing exact `pushed_at` with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by `pushed_at` — forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.

```bash
# Filter to unique-timestamp forks (skip bulk mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
```

### Step 3: Default Branch Divergence

```bash
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# For each candidate fork
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH" \
  --jq '{ahead_by, behind_by, status}'
```

The `status` field meanings:

- `identical` — pure mirror, skip
- `behind` — stale mirror, skip
- `diverged` — has original commits AND is behind (interesting)
- `ahead` — has original commits, up-to-date with upstream (rare, most valuable)

**Important**: Always compare from the upstream repo's perspective (`repos/UPSTREAM/compare/...`). The reverse direction (`repos/FORK/compare/...`) returns 404 for some repositories.

### Step 4: Non-Default Branch Analysis (CRITICAL)

**This is the single biggest methodology improvement.** Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.

Examples:

- flowsurface/aviu16: 7,000-line GPU shader heatmap only on `shader-heatmap`
- ArcticDB/DerThorsten: 147 commits across `conda_build`, `clang`, `apple_changes`
- pueue/FrancescElies: Duration display only on `cesc/duration`
- barter-rs: 6 of 12 top forks had work only on feature branches

```bash
# List branches on a fork
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20

# Check divergence on a specific branch
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH" \
  --jq '{ahead_by, behind_by, status}'
```

**Heuristics for which forks need branch checks**:

- Any fork with `pushed_at` more recent than upstream but `ahead_by == 0` on default branch
- Any fork with more than 1 branch
- Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)

### Step 5: Commit Content Evaluation

```bash
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \
  --jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
```

**What to look for**:

- Commit email domains reveal institutional contributors (`@man.com`, `@quantstack.net`)
- Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
- Build system changes (`CMakeLists.txt`, `Cargo.toml`, `pyproject.toml`) indicate platform enablement
- Protobuf schema changes indicate architectural-level features
- Test files alongside source changes signal production-intent work

### Step 6: Fork-Specific Signals

```bash
# Tags/releases (strongest independent maintenance signal)
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10
gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5

# Open issues on the fork (signals independent project maintenance)
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'

# Check if repo was renamed (strong divergence intent signal)
gh api "repos/FORK_OWNER/REPO" --jq '.name'
```

| Signal                    | Strength                  | Example                                 |
| ------------------------- | ------------------------- | --------------------------------------- |
| Tags/releases on fork     | Highest                   | pueue/freesrz93 had 6 releases          |
| Open PRs against upstream | High                      | Formal proposals with review context    |
| Open issues on the fork   | High                      | Independent project maintenance         |
| Repo renamed              | Medium                    | flowsurface/sinaha81 became volume_flow |
| Build config changes      | High (compiled languages) | Cargo.toml, CMakeLists.txt diff         |
| Description changed       | Weak                      | Many vanity renames with no code        |

### Step 7: Cross-Fork Convergence + Upstream PR History

```bash
# Check upstream PRs from fork owners
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
```

**Cross-fork convergence**: When multiple forks independently solve the same problem, it signals unmet upstream demand:

- firecrawl: 3 forks adopted Patchright for anti-detection
- flowsurface: 3 forks added technical indicators independently
- kokoro: 2 independent batched inference implementations
- barter-rs: 4 forks added Bybit support

**Upstream PR cross-reference catches**:

- Rebased/force-pushed work invisible to compare API
- Work that was merged upstream (fork shows 0 ahead but was historically significant)
- Declined PRs with valuable code that the fork still maintains

---

## Tier Classification

After running the pipeline, classify forks into tiers:

| Tier                          | Criteria                                                  | Action                                  |
| ----------------------------- | --------------------------------------------------------- | --------------------------------------- |
| **Tier 1: Major Extensions**  | New features, architectural changes, >10 original commits | Deep evaluation, cherry-pick candidates |
| **Tier 2: Targeted Features** | Focused additions, bug fixes, 2-10 commits                | Cherry-pick individual commits          |
| **Tier 3: Infrastructure**    | CI/CD, packaging, deployment, docs                        | Evaluate if relevant to your setup      |
| **Tier 4: Historical**        | Merged upstream or stale but once significant             | Note for context, no action needed      |

---

## Domain-Specific Patterns

Different codebases exhibit different fork behaviors. See [domain-patterns.md](./references/domain-patterns.md) for full details.

| Domain                      | Key Pattern                                                                | Example                                               |
| --------------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------- |
| **Scientific/ML**           | Researchers fork-implement-publish-vanish, zero social engagement          | pymoo: 300-file fork with 0 stars                     |
| **Trading/Finance**         | Exchange connectors dominate; best forks are private                       | barter-rs: 4 independent Bybit impls                  |
| **Infrastructure/DevTools** | Self-hosting/SaaS-removal is the dominant theme                            | firecrawl: devflowinc/firecrawl-simple (630 stars)    |
| **C++/Python Mixed**        | Feature work lives on branches; email domains reveal institutions          | ArcticDB: @man.com, @quantstack.net                   |
| **Node.js Libraries**       | Check npm publication as separate packages                                 | dukascopy-node: kyo06 published `dukascopy-node-plus` |
| **Rust CLI**                | Cargo.toml diff is reliable quick filter; "superset" forks add subcommands | pueue: freesrz93 added 7 subcommands                  |

---

## Quick-Scan Pipeline (5-minute triage)

For rapid triage of any new repo:

```bash
UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# 1. Baseline
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'

# 2. Forks with unique timestamps (skip mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'

# 3. Check ahead_by for each candidate
# (loop over candidates from step 2)

# 4. Check upstream PRs from fork authors
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
```

---

## Known Limitations

| Limitation                                      | Impact                                 | Workaround                                                        |
| ----------------------------------------------- | -------------------------------------- | ----------------------------------------------------------------- |
| GitHub compare API 250-commit limit             | Highly divergent forks may truncate    | Use `gh api repos/FORK/commits?per_page=1` to get total count     |
| Private forks invisible                         | Trading firms keep best work private   | Accepted limitation                                               |
| Force-pushed branches break compare API         | Shows 0 ahead despite significant work | Cross-reference upstream PR history                               |
| Renamed forks may break API calls               | Old URLs may 404                       | Use `gh api repos/FORK_OWNER/REPO --jq '.name'` to detect renames |
| Rate limiting on large fork ecosystems          | >1000 forks = many API calls           | Use timestamp clustering to reduce calls by 85%+                  |
| Maintainer dev forks look like independent work | Branch names 1:1 with upstream PRs     | Cross-reference branch names against upstream PR branch names     |

---

## Report Template

Use this structure for the final analysis report:

```markdown
# Fork Analysis Report: OWNER/REPO

**Repository**: OWNER/REPO (N stars, M forks)
**Analysis date**: YYYY-MM-DD

## Fork Landscape Summary

| Metric                                | Value  |
| ------------------------------------- | ------ |
| Total forks                           | N      |
| Pure mirrors                          | N (X%) |
| Divergent forks (ahead on any branch) | N      |
| Substantive forks (meaningful work)   | N      |
| Stars-only miss rate                  | X%     |

## Tiered Ranking

### Tier 1: Major Extensions

(fork details with ahead_by, key features, files changed)

### Tier 2: Targeted Features

...

### Tier 3: Infrastructure/Packaging

...

## Cross-Fork Convergence Patterns

(themes that multiple forks independently implemented)

## Actionable Recommendations

- Cherry-pick candidates
- Feature inspiration
- Security fixes
```

---

## Post-Change Checklist

After modifying THIS skill:

1. [ ] YAML frontmatter valid (no colons in description)
2. [ ] Trigger keywords current in description
3. [ ] All `./references/` links resolve
4. [ ] Pipeline steps numbered consistently
5. [ ] Shell commands tested against a real repository
6. [ ] Append changes to [evolution-log.md](./references/evolution-log.md)


## Post-Execution Reflection

After this skill completes, reflect before closing the task:

0. **Locate yourself.** — Find this SKILL.md's canonical path before editing.
1. **What failed?** — Fix the instruction that caused it.
2. **What worked better than expected?** — Promote to recommended practice.
3. **What drifted?** — Fix any script, reference, or dependency that no longer matches reality.
4. **Log it.** — Evolution-log entry with trigger, fix, and evidence.

Do NOT defer. The next invocation inherits whatever you leave behind.
