The Claude Code skill quality rubric — what the scorer actually checks
Every SKILL.md the ClaudSkills miner crawls runs through a six-axis structural rubric that produces a 0-100 score. Skills scoring ≥80 land in the daily_eligible bucket — the top ~12% of the catalog, source pool for the mobile app's daily push, the Skill of the Day picker, and the social drafter. This post walks through the actual axes, with point ranges and one practical authoring tightening per axis.
The six axes at a glance
| Axis | Max points | Reads as |
|---|---|---|
| Description depth | 20 | How much context can Claude derive from the description alone |
| Vocabulary diversity | 5 | Filler-language penalty proxy |
| Name + body + interface | ~31 | Substance vs stub |
| Anti-trigger discipline | 16 | Does the skill name when it should NOT be invoked |
| Pricing / quota disclosure | 10 | API-driven skills surface their cost |
| Custom frontmatter depth | 15 | Goes beyond name+description |
| Sum | 97 | Plus the -5 filler-phrase penalty (see Axis 7) |
A baseline skill that does the obvious things — 200-char description, 500-char body, a tools field, an arg-hint — lands around 60. Skills demonstrating scope discipline + pricing transparency + rich frontmatter land 80-95. Stub skills (just name + a vague one-liner) land 15-35. The rubric clusters in the middle deliberately; admission threshold is 50, so everything below that doesn't enter the catalog at all.
Axis 1 — Description depth (0-20)
Description depth 0 → 20 points
| Description length (chars) | Points |
|---|---|
| ≥ 800 | 20 |
| ≥ 400 | 17 |
| ≥ 200 | 13 |
| ≥ 80 | 8 |
| ≥ 40 | 4 |
| < 40 | 0 |
Tightening: aim for ≥200 chars. The jump from 80-char to 200-char descriptions is 5 points; from 200 to 400 is another 4. Claude reads the description to decide whether to invoke the skill — a 40-char description gives the model almost no signal.
Axis 2 — Vocabulary diversity (0-5)
Vocabulary diversity in description 0 → 5 points
| Distinct meaningful words (length > 2) | Points |
|---|---|
| ≥ 8 | 5 |
| ≥ 5 | 2 |
| < 5 | 0 |
Tightening: avoid stop-word stuffing. A description like "this skill helps with code" has 5 words but ~2 distinct meaningful ones. Compare "Audits Terraform plans for security misconfigurations and IAM privilege escalations" — same length, 10 distinct meaningful words. The rubric is a proxy for "does the description actually describe the skill."
Axis 3 — Name + body + interface (0-31)
Substance axis 0 → ~31 points
This axis stacks several sub-checks additively. None is individually large, but they compound.
| Sub-check | Points |
|---|---|
| Skill has a name distinct from description, ≥4 chars | +5 |
| Body ≥ 2,000 chars | +15 |
| Body ≥ 500 chars (and < 2,000) | +8 |
| Body ≥ 100 chars (and < 500) | +3 |
Has allowed-tools declared | +5 |
Has argument-hint declared | +3 |
| Has any examples / inputs / outputs section | +3 |
Tightening: the 500 → 2,000-char body jump is +7 points — the single most lucrative edit on this axis. Most stub skills clock around 200-400 chars of body; doubling to 800-1,000 chars (one section of real usage examples) usually crosses the threshold.
Axis 4 — Anti-trigger discipline (0-16)
Anti-trigger discipline 0 → 16 points
This axis is the single highest-leverage edit you can make to a SKILL.md. The scorer counts how many distinct anti-trigger patterns appear anywhere in the body, capped at 4 patterns = 16 points (4 each).
Detected patterns (case-insensitive substring match):
when not to usewhen not to invokedo not usedo not invokeskip this skillout of scopenot foravoid usinganti-pattern/anti patternthis skill is notthis is not the right skill
Tightening: add a 50-100 word "When NOT to use this skill" section. The named-section style is the easiest path to 4 anti-trigger pattern matches:
## When NOT to use this skill
Skip this skill if:
- You need a one-off ad-hoc analysis — that's out of scope for this skill.
- You're working with a binary file (PDF, image) — this skill is not for those formats.
- You want speculative refactoring suggestions — use the related code-review skill instead.
That single section typically lands 4 distinct anti-trigger matches = max 16 points.
Axis 5 — Pricing / quota disclosure (0-10)
Pricing / quota disclosure 0 → 10 points
Two detectors stack additively (capped at 10):
- Dollar-bearing tables — markdown tables with
$followed by digits. Matches a pricing table. - Quota keywords —
rate limit,requests per second/minute/hour/day/month,tokens per …,quota,tier 1 / tier 2,free tier,paid tier.
Tightening: only applies to skills that wrap an external paid API (OpenAI, Anthropic, Stripe, AWS, etc.). For these, add a 4-line pricing table:
| Provider | Free tier | Paid tier | Rate limit |
|---|---|---|---|
| Foo | 100/day | $5/1k requests | 60 rpm |
Worth 10 points on a skill that would otherwise score 0 on this axis.
Axis 6 — Custom frontmatter depth (0-15)
Custom frontmatter depth 0 → 15 points
The rubric counts distinct frontmatter keys beyond the standard name + description pair. Each extra key adds 1.5 points, capped at 10 extras = 15 points.
Worth-including extras (any of these count):
---
name: my-skill
description: Two-sentence summary.
model: claude-3-5-sonnet
tags: [type:audit, lang:python, cloud:aws]
version: 1.2.0
license: MIT
allowed-tools: ["read", "edit", "bash"]
user-invokable: true
category: security
subcategory: cloud
argument-hint: ""
metadata:
api_base: https://api.example.com
rate_limit_rpm: 60
author: jane-doe
repository: https://github.com/jane/my-skill
---
This block has 13 extra keys (caps at 10 = max 15 points). Most SKILL.md files in the catalog have 0-2 extra keys; getting to 5-7 is straightforward, and lands ~8-10 points.
Tightening: the highest-EV adds are tags, allowed-tools, and license — they're useful to downstream consumers anyway, and the scorer treats them equally with the more esoteric fields.
The filler-phrase penalty (−5)
If the body literally contains the phrase "this skill" or "a skill that" or "use this skill" or "this is a skill" anywhere, the rubric subtracts 5 points. The penalty is intentionally crude but catches a common authoring tic: starting every paragraph with self-referential framing instead of content. The fix is to delete the phrase — usually the sentence reads better without it.
daily_eligible: true on the public skills.json. That's the bucket the mobile app's daily push, the Skill of the Day picker, and the social drafter all read from. Today, 7,285 of 69,016 skills clear that threshold (10.6%).
End-to-end worked example
Take a baseline skill with no discipline applied — 80-char description, 300-char body, name + description only in frontmatter, no anti-trigger section, no pricing table, includes the phrase "This skill helps you…":
| Axis | Score | Notes |
|---|---|---|
| Description depth | 8 | 80 chars → +8 |
| Vocabulary diversity | 2 | ~6 distinct words |
| Name + body + interface | 8 | name diff (5) + body 300 (3) |
| Anti-trigger | 0 | — |
| Pricing/quota | 0 | — |
| Frontmatter | 0 | — |
| Filler penalty | −5 | "this skill" present |
| Sum | 13 | Below admission threshold (50) |
This skill would NOT be admitted to the catalog. Now apply the four cheapest tightenings — bump description to 200 chars, body to 800 chars, add a 4-anti-trigger section, add 5 frontmatter keys (tags, license, allowed-tools, model, version):
| Axis | Score | Notes |
|---|---|---|
| Description depth | 13 | 200 chars → +13 |
| Vocabulary diversity | 5 | ≥8 distinct words |
| Name + body + interface | 18 | name (5) + body 800 (8) + tools (5) |
| Anti-trigger | 16 | 4 distinct patterns matched |
| Pricing/quota | 0 | (N/A for this skill) |
| Frontmatter | 8 | 5 extras × 1.5 |
| Filler penalty | 0 | removed |
| Sum | 60 | Admitted, mid-tier |
~30 minutes of editing took the skill from below admission to mid-tier admitted. Two more tightenings (push body to 2,000 chars, add 5 more frontmatter keys) would land it at ~78, the cusp of daily_eligible.
Things the rubric does NOT check
- Code quality. The scorer doesn't parse code in the body. Examples are evaluated only for length contribution.
- Author identity. No signal whatsoever on who wrote the skill. Anthropic's reference skills score the same way as a first-time contributor's.
- Popularity. Zero stars, forks, install counts, or share metrics. Permanent design decision documented in the engine spec.
- Topicality / category fit. A perfect Sales skill scores the same as a perfect Engineering skill. Category is set separately by the miner's classification heuristic.
- Recency. A skill written 18 months ago that still applies cleanly today scores the same as one written yesterday.
- Language. SKILL.md in English, Japanese, Portuguese, or Russian score by the same rubric. Anti-trigger patterns are English-only currently — a known gap. Non-English skills route through the description-depth + body-length + frontmatter axes only.
If your skill scores 75-79 today
You're close. The two highest-leverage edits to cross 80:
- Add a "When NOT to use this skill" section. Worth up to 16 points alone. If you have a vague 2-line "consider scope" comment, formalising it into a named section with 4 distinct anti-trigger patterns is the single biggest jump available.
- Add 2-3 frontmatter fields you don't have.
model,version,allowed-tools,license,tags,argument-hint— pick three you can fill in honestly. Each is 1.5 points.
Those two together typically move a 75-79 skill to 85-92. The score recomputes at the next miner cycle.
Methodology
Source: scripts/quality_score.py in the ClaudSkills site repo. Every constant in this article is reproducible from that file. The rubric was tuned against a hand-rated reference set of ~50 skills during the Phase 3.4 sprint (2026-05-07). Spearman correlation against human ground truth: ρ = +0.778 — strong positive. The rubric is intentionally conservative: false positives (a low-quality skill scoring high) are rarer than false negatives (a high-quality skill scoring lower than it should). Add the patterns the rubric wants, and the rubric responds.
FAQ
- Is the quality score the same for free and Pro users?
- No. Free pages display an unrated catalog — no quality score is shown on free skill pages. Pro users see a numeric Quality Score on every skill in the desktop app. Public
skills.jsoncarries adaily_eligible: trueflag for skills scoring ≥80; the raw numeric score is server-side only. - Does the rubric look at popularity signals?
- No. Popularity signals were permanently dropped on 2026-05-07. Pro Quality Score is 100% content-derived: 80% this structural rubric + 20% metadata depth.
- How often does the rubric re-evaluate my skill?
- Every miner cycle (nightly at 01:00 local, typically finishes by ~05:00). Edits to the upstream
SKILL.mdreflect within ~24 hours. - Are there secret bonus signals the scorer rewards?
- No. The full source is at
scripts/quality_score.pyin the public site repo. Every constant and threshold is visible. - What if my skill ALREADY scores well but isn't in daily_eligible?
- The threshold is quality_pro ≥ 80, applied after structural + metadata are blended. If you're at 75-79, the highest-leverage tightenings are: add an explicit "When NOT to use" section (worth up to 16 points alone), and add 2-3 frontmatter fields you don't have.