How I indexed 69,000 Claude Code skills (and what I learned doing it)

If you've never written a Claude Code skill: it's a Markdown file with YAML frontmatter that gives Anthropic's Claude Code agent specialized behavior. Drop it in ~/.claude/skills/<name>/SKILL.md and Claude can invoke it as a slash command. Think of it like a Vim plugin or a VSCode extension, except the contract is "instructions in English" rather than "code in Lua / TypeScript."

The format is brand-new. The official spec doesn't ship a catalog. The awesome-* lists I could find at the time covered maybe 300 hand-picked entries. Meanwhile, GitHub's code search showed thousands of public repos with SKILL.md files in them. The long tail of the ecosystem was completely invisible. That's the gap I set out to close.

The shape of the problem

Here's what I knew going in:

Discovery was broken. A skill author would push their SKILL.md to GitHub and ... nothing. No directory, no aggregator, no search surface. The only way another developer found it was Twitter, Discord, or stumbling onto the repo.
Quality varied wildly. Some skills were 200-line operator-grade tools with pricing tables, anti-trigger sections, and structured examples. Others were 4-line stubs that read like "TODO: write a skill that does X." Both were indexable, neither was distinguishable from outside.
The format itself was changing fast. The frontmatter spec gained fields monthly — allowed-tools, user-invokable, model, metadata.api_base. Yesterday's "good" SKILL.md could be tomorrow's missing-required-field.
There was no good API surface. If you wanted to build something on top of the skill ecosystem (a tool for evaluating skills, a recommender, an installer), you had to scrape GitHub yourself.

I wanted a catalog that fixed all four. Open data, daily refresh, free API, free dataset. No pay-to-list, no listing fees, no ranking-for-money. The only paid product would be an evaluation layer for end-users (a quality score in the desktop app), never anything skill authors had to opt into. Anti-rent-seeking by construction.

The miner — 24 sources, every night

The catalog is built by a single Python script that runs on a Mac mini in my office at 01:00 local. It crawls 24 public sources looking for SKILL.md files:

Source	What it discovers
GitHub code search (`filename:SKILL.md`)	The bulk of the catalog — 101 query variants covering language hints, frontmatter fields, and date-bounded slices to defeat the 1000-result hard cap
GitHub Topics (`topic:claude-code-skills`) + 31 variants	Topic-tagged repos
GitHub Gists	Single-file skills posted as gists (most catalogs miss these)
Awesome-list READMEs (32 lists)	Anything the existing curators picked
GitLab, Codeberg	Skills outside GitHub
HuggingFace	Skills uploaded as datasets
Reddit, HackerNews, Bluesky, Mastodon, dev.to, YouTube, Telegram	Mentions in posts/comments — text-blob scan for repo URLs
Wayback Machine CDX API	Renamed / deleted repos still discoverable via archive.org
Stargazer graph mining	Once we find one good skill repo, mine who starred it — they often have skills too
Author repo enumeration	When we admit one of an author's skills, scan their other repos
Topic co-occurrence	Topics tagged alongside `claude-code-skills` get crawled for next run
VSCode + Open VSX marketplaces	Some extensions ship with SKILL.md companions
Brave Search API	Web-search-anchored discovery
LLM query expansion	Claude generates next-week's search queries based on what's been found

Each source returns candidate repo URLs. The miner fetches the SKILL.md, validates the YAML frontmatter, runs admission scoring (more on this below), categorizes by domain (Engineering / Security / Growth / etc. — 10 categories total), tags across ~100 orthogonal dimensions (language, framework, AI provider, cloud, integration type), and writes a static HTML page at /skills/<slug>/.

The miner is bounded: per-source caps prevent any one source from draining the GitHub API budget; every section runs inside a _safe_section() try-block so a single broken endpoint can't kill the run.

A full run takes about 4 hours. New skills appear on the live catalog the same day they're discovered.

Admission — content signals only, no popularity

This is the part I'm most opinionated about. Ranking can't be bought. The moment a paid signal influences who appears in the catalog (or in what order), the value proposition collapses — nobody pays for "objective evaluation" when it isn't objective.

So the catalog admits skills based on a content score derived from the SKILL.md itself:

Anti-trigger discipline — does the SKILL.md have a "when NOT to use" or "out of scope" section? That's a +4 per pattern, capped at +16. Strong negative-space marking is the single best signal that the author thought about edge cases.
Pricing / quota transparency — does it document costs, rate limits, or expected API spend? +10.
Frontmatter depth — beyond name: and description:, how many other fields are present (model:, tags:, version:, license:, allowed-tools:, metadata.*)? Capped at 10 distinct keys to prevent gaming.
Length × structure — is the body substantive (>800 chars in description:, multiple code blocks, headings)?
Filler-phrase penalty — // TODO, Lorem ipsum, generic templated phrases → minus 5.

The score never weighs stars, forks, install counts, GitHub follower count, or any other popularity signal. A skill written by a developer with 0 GitHub followers and a clear anti-trigger section beats a flashy skill by a 50k-follower influencer that's just frontmatter-and-vibes. That's the bar.

For ranking inside the desktop app's Pro tier — a separate evaluation layer — the formula is the same content-only structural score plus frontmatter-completeness, rescaled to [50, 100]. Still no popularity signals.

This costs me about 30% of what an unconstrained "rank by stars" catalog would surface. I'm OK with that trade.

What surprised me

1. The catalog is dominated by a handful of prolific authors. One contributor has 3,446 admitted skills (yes, really). The top 25 authors account for ~30% of the catalog. There's a Pareto distribution underneath the long tail.

2. Sales-category skills score highest on content quality. Counter-intuitive — I expected Engineering or Security to be most polished. Turns out sales-focused skill authors over-index on structure (anti-trigger sections, scope discipline, pricing transparency) because that's their professional habit. Engineering authors more often skip the "when NOT to use" section because they assume it's obvious.

3. Vendor-side adoption is still 0. The catalog has zero skills with author_url pointing at anthropic.com, openai.com, or any other large AI vendor. Every entry is independent. The ecosystem is fully community-driven.

4. The SKILL.md format is leaking sideways. I found skills in repos tagged cline-skills, cursor-rules, aider-skills, windsurf-rules. The format is becoming a portable agent-skill standard, not just a Claude Code thing. The catalog admits these too — they're SKILL.md files, the agent that loads them is the user's choice.

5. The biggest discovery surface isn't GitHub code search. It's the stargazer graph. When a SKILL.md hits a few hundred stars, the people who star it have a 30%+ rate of having their own SKILL.md somewhere in their account. Mining the graph yields skills the code-search queries don't find.

What's free

Everything the catalog produces is open:

Public catalog at claudskills.com — browseable + searchable.
Open dataset at github.com/claudskills/catalog-public — daily refresh in 6 formats (JSON, NDJSON, CSV, Parquet, Atom feed, README). CC BY 4.0.
HuggingFace mirror at huggingface.co/datasets/claudskills/skills — same data, parquet-native, suitable for LLM training.
Public REST API at claudskills.com/api/v1/ — read-only, no auth, CORS-open, edge-cached. OpenAPI 3.1 spec covers every endpoint. Paginated /skills, single-skill /skills/<slug>, /categories, /tags, /stats. The catalog API itself is ~300 LOC of Cloudflare Worker code; the heavy lifting is the daily miner.
Embeddable skill card at claudskills.com/embed/<slug>.js — one-line <script> tag that injects a styled card into any blog post or doc page. The card you'd drop into your own writeup of a favorite skill.
Shields.io-style badge at claudskills.com/badge/<slug>.svg — for skill authors to drop into their GitHub READMEs.
Daily Skill-of-the-Day archive at /sotd/ — every UTC day picks one skill via a date-hash that stays consistent across mobile push, social posts, and the web.
Per-category, per-tag, per-author, and per-use-case landing pages — about 2,800 hub pages total covering the catalog from every browsing angle.

What I'd change if starting over

A few things I learned the hard way:

Build the public dataset first, the website second. I focused on the consumer-facing site early — should have shipped the open dataset first. Researchers and tool-builders pick up CC BY 4.0 data within days of finding it; consumer-facing UIs take weeks to build word-of-mouth.
Cloudflare Workers + R2 + Netlify together is more reliable than any one of them. The site has 64,000+ per-skill HTML pages, which would blow Netlify's deploy-prep budget at scale. So per-skill HTML files live in Cloudflare R2 with a Netlify rewrite to serve them from claudskills.com/skills/<slug>/. API + embed + badge endpoints are Cloudflare Workers bound to the same domain. The homepage + static pages are direct from Netlify. Each layer doing what it's best at.
Anti-popularity signals were the hardest decision and the most important one. Every time I evaluate a candidate change to the ranking algorithm, "would skill authors pay to influence this?" is the test. If yes, the change doesn't ship. The discipline pays off when you have a Pro subscription product — it's "pay $9/month for the multi-signal Quality Score in the desktop app," and there's nothing for me to defend about why the score is honest. It's honest by construction.

What's next

The next quarter is about distribution — the catalog exists, now developers need to find it. The roadmap:

25 awesome-list PRs (live next week)
A weekly catalog-growth report cross-posted to dev.to / Hashnode / Medium / LinkedIn
Embed cards in third-party blog posts (the API is ready; the inbound demand will tell us if the embed surface gets traction)
iOS and Android companion apps for discovery (already in App Store review at the time of writing)

If you've written a SKILL.md, it's probably already in the catalog — search for your repo name at claudskills.com. If you haven't, the catalog will pick it up within 24 hours of you pushing to a public GitHub repo. If you want to fast-track it, there's a submit form on the homepage.

If you're a researcher, a tool-builder, or an LLM-pipeline operator who wants to ingest the data: the public dataset refreshes daily, and the API is rate-limit-free for normal use. Build something cool — I'd love to hear about it.

The catalog is at claudskills.com. The dataset is at github.com/claudskills/catalog-public. Comments + questions to [email protected].

ClaudSkills is an independent community catalog. Claude™ is a trademark of Anthropic PBC; ClaudSkills is not affiliated with, endorsed by, or sponsored by Anthropic.

How I indexed 69,000 Claude Code skills (and what I learned doing it)

The shape of the problem

The miner — 24 sources, every night

Admission — content signals only, no popularity

What surprised me

What's free

What I'd change if starting over

What's next

Categories

Use cases

Popular tags

Learn

Site