Are you saying the catalog you're publishing this on is mostly noise?

I'm saying the median entry in any open catalog — this one included — is weaker than the entries you'd actually want to install. That's a structural property of open ingestion, not a flaw in any particular catalog. The reader still has to do the second-level filtering, and the catalog should give you the signals to do it efficiently.

What's the single most reliable signal that a skill is worth installing?

An explicit "When NOT to use" section in the SKILL.md. Authors who write that section have already thought about anti-triggers and scope, which correlates with everything else you care about. Skills that lack it tend to over-fire and erode your trust in the system.

How long does the 90-second test actually take in practice?

For experienced readers, about a minute once you've internalised the steps. For your first ten or twenty audits, expect closer to three or four minutes per skill. The investment pays back quickly because every skill you don't install is a skill you don't have to uninstall later.

Should I just stick to skills from authors I already trust?

That's the right default, but don't make it a hard rule. New authors ship excellent skills regularly, and you'll miss them if you only browse familiar names. Use author reputation to lower your filtering threshold, not as a binary gate.

Is there a way to bulk-evaluate skills programmatically?

You can lint frontmatter — check for description length, allowed-tools width, presence of anti-trigger sections — and that catches the most obvious failure modes. But the harder judgement calls (scope discipline, example honesty, whether the author has actually used the skill under load) still require human reading. Don't outsource the whole evaluation to a script.

What about skills that look thin but are actually deep?

They exist. A genuinely good skill in a narrow domain can be short, because the domain doesn't need many words. The tell is whether the brevity is disciplined or lazy: disciplined brevity covers scope, anti-triggers, and at least one example in few words; lazy brevity skips all three. Read carefully — short isn't bad, but vague is.

How often should I re-audit skills I've already installed?

Every few months, and immediately if a skill misbehaves in a session. Authors update; skills that were tight at publication can drift if the author keeps adding scope. A short quarterly sweep of your installed skills, removing the ones you haven't invoked in months, keeps the active set sharp.

Home › Learn › Why Most Claude Code Skills Don't Deserve to Be Installed

Why Most Claude Code Skills Don't Deserve to Be Installed

Published 1 June 2026 · 11 min read · By a long-time Claude Code practitioner

The dirty secret of every open catalog — including this one — is that most entries shouldn't be installed. Not because the authors are careless, but because the bar for "a skill exists" is far lower than the bar for "a skill is worth invoking inside a Claude Code session."

This piece is the honest tour: the median problem nobody talks about, the five failure modes I see again and again when I audit skills before installing them, the patterns that distinguish the keepers, and a 90-second test you can run yourself. It's pointed in places. It's also the post I wish someone had written before I installed my first forty skills and quietly uninstalled thirty of them.

In this guide

The dirty secret of open catalogs
The median problem
Five failure modes to watch for
What a should-install skill looks like
The honest framing on catalog incentives
The 90-second install test
In defence of the open-catalog model
Advice to authors: less, sharper

The dirty secret of open catalogs

Every open catalog of Claude Code skills — the registry-style sites, the awesome-list GitHub repos, the curated indexes, all of them — operates under the same implicit promise: here is the universe of available skills, browse freely. What the promise doesn't say is that the universe of available skills includes a lot of work that shouldn't have been published, was published anyway, and now sits in the listing looking exactly like the genuinely useful entries beside it.

This isn't a criticism of catalog operators. The job of an open catalog is to ingest what's out there. Curation is a different job, and the two compete: heavy curation shrinks the catalog and loses entries that some users would have wanted; light curation keeps everything and ships you a lot of noise to filter.

The reason this matters for you, the person about to type an install command, is that the catalog can't make the judgement call for you. Whether a particular skill is worth running inside your session depends on what you're building, what your team's tolerance for half-finished tooling is, and whether you trust the author. A catalog can sort by signals — recency, structure, frontmatter completeness, repo activity — but it can't tell you that the skill you're about to install will, on its third invocation, decide that what your codebase really needs is to be migrated to a different testing framework.

So the burden of the final filter sits with the reader. The catalog gets you to a shortlist; you decide what crosses the threshold into your ~/.claude/skills/ directory. The rest of this piece is a working practitioner's view of where the noise comes from, how to spot it fast, and what the genuine keepers look like once you've trained your eye.

One quick note on tone before we keep going. I'm going to be direct about failure modes. I'm not going to name specific skills or authors, because the failure modes are structural, not personal. The same author who shipped a one-paragraph stub last month might ship a brilliant, narrowly-scoped tool next month. The patterns are what's worth learning. The author roster shifts every week.

The median problem

When you sample a few hundred random skills from any large open catalog, a few things become obvious very quickly. The mean skill is shorter than people assume — often a few paragraphs of description and a fenced shell command. The scope is usually vague: a skill called code-reviewer that turns out to do everything from style nits to architectural feedback, with no guidance on when to invoke it versus when to leave Claude alone. The testing is almost always absent. There is rarely an explicit list of cases where the skill should not fire.

None of this makes the median skill unusable. It does make it unreliable. The unreliable skill is the worst category to have installed, because it loads on every relevant trigger word, behaves inconsistently, and slowly erodes your trust in the whole skill system. Skills that never fire are easy to remove. Skills that fire when you didn't want them, do something nearly-right, and produce output you have to read carefully to evaluate — those are the ones that cost you real time.

The median problem also has a self-reinforcing dynamic. Authors who treat skill publishing as a portfolio activity ship many small skills, because volume looks better than depth on a contributor page. Authors who treat it as craft ship fewer skills, more slowly, and write them carefully. Both end up in the same catalog. The first group is over-represented in raw counts; the second group is over-represented in skills you'd actually want to install.

This is why the brute approach of "install everything in a category and see what sticks" works badly. You'll spend hours triaging things that should never have made it onto your machine in the first place. The smarter approach is to start sceptical, raise the bar deliberately, and let the small number of skills that clear it earn their place. A good rule of thumb: if you couldn't explain to a colleague in one sentence what a skill does and when it fires, don't install it yet.

The catalog is doing its job by showing you the full inventory. Your job is to remember that the full inventory is not the same thing as a recommendation.

Five failure modes to watch for

After enough audits, the failures cluster into a small number of recurring shapes. Five of them cover almost everything I reject.

Over-broad descriptions. The skill's description: frontmatter field is what Claude reads to decide whether to invoke. A description like "helps with code review" will fire on practically any session that touches a pull request, including ones where the user just wanted a quick syntax check. The skills I keep have descriptions that read like clauses in a contract: "use when the user pastes a unified diff and asks for review feedback; do not use for single-file linting or commit-message review." Specificity is a feature, not a flaw.

No anti-trigger discipline. A skill that says only what it does has half the instructions it needs. The other half is what it should not do, what it should defer to a different skill for, and what it should hand back to the user. Skills without an explicit "When NOT to use" section will over-fire. This is one of the most reliable signals of a careful author.

Instructions that fight Claude's defaults. Some skills read as if the author was annoyed at Claude's baseline behaviour and tried to override it through aggressive prompting — "never apologise, never explain, always output JSON, never add commentary." This works for the first few turns and then breaks in interesting ways. The skills I trust extend Claude's defaults rather than fighting them. If a skill needs to suppress Claude's normal explanations, the author should explain why, not just demand it.

Allowed-tools that grant write access without need. A skill that lists Write, Edit, Bash in its allowed-tools: when its actual job is read-only analysis is a yellow flag. The author either didn't think carefully about the minimum required permissions, or wanted the option to mutate things and didn't tell you. Either way, you don't want it loaded by default. The good skills request the smallest tool set that makes the job possible.

Copy-paste forks with no real differentiation. Some skills are visibly cloned from a small number of popular templates with the name swapped and a few words rewritten. They tend to share the same scaffold, the same examples, sometimes the same typos. If a skill reads like it could be any of fifty other skills, it probably is. The catalog shows them all; you don't need to install them all.

If you see two or more of these in a single skill, that's usually enough to skip it. One on its own is worth a closer look — sometimes there's a good reason, often there isn't.

What a should-install skill looks like

Inverting the failure modes gets you most of the way there, but the genuinely strong skills share three positive patterns that are worth naming directly.

Narrow scope, defended explicitly. The best skills do one thing in a domain where that one thing is genuinely hard. A skill that converts OpenAPI specs to typed client code is a good shape. A skill that "helps with backend development" is not. The narrow ones tell you in the first paragraph what they refuse to do — not as a disclaimer, but as a positioning statement. "This is not a general-purpose linter. It will not rewrite your import order. It will not enforce style. It diagnoses a specific class of N+1 query antipattern in Django ORM code, and it tells you the line numbers." A skill that talks like that is usually written by someone who has actually used it under pressure.

Pricing, quota, and external-service honesty. If the skill calls an external API, the good ones tell you upfront: which provider, which endpoint, the rough cost of a typical invocation, what happens when the quota is exhausted, what happens when the provider returns 429. The skills that hide this usually hide it because they didn't think about it. The first time you hit a rate limit on a skill that doesn't gracefully degrade, you'll wish the author had been more explicit.

A worked example or two, in the SKILL.md. Not a marketing screenshot — an actual transcript or a fenced code block showing input and output. Skills that ship with a small examples/ section are taking the trouble to show you what the skill is meant to do, which is a strong indicator that the author has actually tested it. Here's the kind of frontmatter and structure I look for as a positive signal:

---
name: openapi-to-typed-client
description: |
  Use when the user provides an OpenAPI 3.x spec and asks for a typed
  TypeScript client. Do NOT use for client generation from non-OpenAPI
  sources, for runtime validation library generation, or for spec
  authoring/editing tasks.
allowed-tools: [Read, Write]
model: claude-sonnet-4-5
version: 1.4.0
license: MIT
---

That frontmatter alone gives you scope, defaults, anti-triggers, permissions, and version discipline. Pair it with a few hundred words of focused prose and one or two real input/output examples, and you have a skill that earns its place on your machine.

You won't find dozens of skills like this in any catalog. You'll find tens. The good news is that tens is plenty — most working practitioners use a handful of skills heavily and a long tail occasionally. Find your handful, keep them current, and ignore the rest until you have a specific use case.

The honest framing on catalog incentives

Here is the structural tension that any open catalog operator has to navigate, and that you as a reader should be aware of when you browse one.

More entries make a catalog look bigger. Bigger catalogs feel more authoritative, get linked to more often, and rank better in the places people go to find skills. The incentive, at every level, is to admit more rather than fewer.

Fewer entries — or at least, sharper filtering between the keepers and the noise — make a catalog more useful. A reader who lands on a catalog and finds the first ten entries they browse are all worth installing learns to trust the catalog. A reader whose first ten installs include two genuinely useful skills and eight that get uninstalled within a week learns to be sceptical of the whole listing, not just the eight that didn't work out.

These two pressures pull against each other. The honest way to handle them is to publish the full inventory and give the reader the tools to do the second-level filtering themselves. Sortable signals, frontmatter previews, source attribution, last-updated dates, repo-activity indicators, anti-trigger highlights — anything that helps the reader spot the keepers without having to install ten skills to find the one. None of this replaces your judgement, but it shortens the time you spend exercising it.

What you should be wary of, in any catalog (this one included), is the implicit framing that catalog inclusion is itself a recommendation. It isn't. Inclusion means "the catalog's miner found this skill, parsed it, and didn't reject it at admission." It does not mean "the catalog operator vouches for this skill in your specific use case." The catalog is the bookstore; you still have to pick the book.

The corollary is that if a catalog appears to vouch for everything — if every entry has a polished landing page, a high-sounding description, and no honest signals about quality variance — you should trust the catalog less, not more. The catalogs worth using are the ones that admit, somewhere visible, that not every entry is for everyone, and that surface the signals you'd need to tell.

This catalog tries to be in the second group. The median problem still applies. Browse with that in mind, and you'll get more out of it.

The 90-second install test

Here is the filter I run on every skill before I install it. It takes about 90 seconds and catches the vast majority of skills that wouldn't have earned their place. You don't need a special tool for this; you need the skill's SKILL.md file open in front of you.

Read the description: field, out loud if you can. Can you state, in one sentence, when this skill should fire? If the description is generic ("helps with X"), or so long that you can't summarise it, skip. Good descriptions are surgical.
Search the SKILL.md for the strings "do not", "skip", "not for", "out of scope". If none of them appear, the author probably hasn't thought about anti-triggers. That's a yellow flag; you'll be the one who finds the over-fire conditions in production.
Check allowed-tools: against the skill's actual job. A skill that reads code and reports findings shouldn't need Write. A skill that runs tests doesn't need Bash with sudo. If the permissions feel wider than the description warrants, ask why — and if there's no answer in the SKILL.md, skip.
Look for a worked example. A real input and a real output, ideally in a fenced code block. Skills without examples have rarely been tested by anyone other than the author, and the author had the full context loaded.
Check the author and the source repo. Is there a real GitHub link? An author handle that's published other skills you trust? A last-commit date in the past few months? None of these guarantee quality, but the absence of all of them is a strong signal that the skill is abandoned scaffolding.
Sanity-check the install command. Does it just copy the SKILL.md, or does it run an arbitrary shell pipe? Anything that pipes curl into bash for a skill install is over the line. Skill installs should be file copies, full stop.

Run this filter and you'll skip eight out of ten skills you would have otherwise installed. The two that pass will be better than the average of the ten you would have installed without it. Over a year, the time saved by not having to uninstall the eight is more than the time spent running the filter.

If you want a longer, structured version of the same exercise, the Claude Code skill quality checklist goes through the same material at more depth, with the criteria spelled out as a yes/no walkthrough.

In defence of the open-catalog model

Given everything above, you might reasonably ask: would a walled garden be better? An invitation-only, heavily-curated list of skills vetted by humans before publication?

For some users, yes. For the ecosystem as a whole, no — and the reasons are worth being honest about.

A walled garden's curation bottleneck is also its choke point. Whoever decides what gets admitted shapes what the user base believes a skill is, what good skills look like, and which problem spaces are legitimate. That shaping power is large, and it tends to entrench whatever the curator's first cohort believed was best practice. New idioms, new tooling integrations, and skills that solve problems the curator doesn't personally have get filtered out — not maliciously, just because the curator doesn't recognise them as worth admitting.

Open catalogs make a different trade. They admit too much, the median is mediocre, and the reader has to filter. In exchange, the catalog reflects what the community is actually building, not what one curator's taste says should be built. The interesting new patterns — the skills that turn out to be excellent in retrospect — show up first in open catalogs, often months before any walled garden would have noticed them. The cost is the filtering burden you carry as a reader. The benefit is that you see the full surface of what's possible.

The other defence of the open model is correction speed. When a walled garden makes a bad admission decision, that skill is endorsed for as long as the curator hasn't gotten around to revisiting it. When an open catalog publishes a bad skill, every reader's filter catches it independently, and the bad skill gets quietly ignored within weeks. The error correction is distributed, which means it's faster than any one curator could manage.

The pragmatic stance is to use both. Open catalogs are where you discover. Curated lists — including editorial picks within an open catalog — are where you save time when you have a specific problem and don't want to filter from scratch. Neither is the right answer alone. The combination is what gets you to a skill set that's broad enough to be useful and selective enough to stay reliable.

The takeaway: don't avoid open catalogs because of the median problem. Use them with the filtering muscle that the median problem demands. The skill of reading skill metadata sceptically is, itself, one of the most valuable things you can learn this year.

Advice to authors: less, sharper

If you publish skills — or you're thinking about it — here is the version of this piece that applies to you.

Shipping volume hurts you more than it helps. Every additional skill under your name is one more chance for a reader to install something of yours, have a bad experience, and remember the experience as "that author's stuff doesn't really work." The portfolio framing — "I have published thirty skills" — feels like an asset and isn't. What you actually want is a small number of skills that the readers who use them love, recommend to others, and keep installed across the months when their needs change.

Sharper skills earn that loyalty. Narrow the scope until you can defend the description in one sentence. Write the "When NOT to use" section before you write anything else; it forces you to define the boundary. Test the skill in three or four real sessions before you publish, not after. If you can't think of a real session it would have helped you with, don't publish it yet.

The SKILL.md file is the entirety of your reader's impression of you. They will not click through to your GitHub profile, watch your demo video, or read your blog post. They will read the first paragraph of the SKILL.md, scan the frontmatter, and decide. Spend more time on that file than feels reasonable. The amount of polish a skill needs to look credible is higher than it used to be, because readers have seen enough catalog noise to be sceptical by default.

If you're new to writing skills and want a step-by-step on the file format itself, writing a SKILL.md file walks through the structure, the YAML conventions, and the formatting patterns that read as competent rather than amateur. The mechanical part of writing a good skill is fast to learn. The judgement part — knowing when to publish, when to keep iterating, when to delete the draft and walk away — takes longer, and is what separates the authors people remember.

One last thing. Most of the authors whose skills I trust have between two and six published skills, all in adjacent domains, all updated within the last few months. The authors with fifty skills are usually published-and-forgotten. Decide which kind of author you want to be early. It's much easier to grow a small, sharp catalog than to shrink an overgrown one.

Frequently asked questions

Are you saying the catalog you're publishing this on is mostly noise?: I'm saying the median entry in any open catalog — this one included — is weaker than the entries you'd actually want to install. That's a structural property of open ingestion, not a flaw in any particular catalog. The reader still has to do the second-level filtering, and the catalog should give you the signals to do it efficiently.
What's the single most reliable signal that a skill is worth installing?: An explicit "When NOT to use" section in the SKILL.md. Authors who write that section have already thought about anti-triggers and scope, which correlates with everything else you care about. Skills that lack it tend to over-fire and erode your trust in the system.
How long does the 90-second test actually take in practice?: For experienced readers, about a minute once you've internalised the steps. For your first ten or twenty audits, expect closer to three or four minutes per skill. The investment pays back quickly because every skill you don't install is a skill you don't have to uninstall later.
Should I just stick to skills from authors I already trust?: That's the right default, but don't make it a hard rule. New authors ship excellent skills regularly, and you'll miss them if you only browse familiar names. Use author reputation to lower your filtering threshold, not as a binary gate.
Is there a way to bulk-evaluate skills programmatically?: You can lint frontmatter — check for description length, allowed-tools width, presence of anti-trigger sections — and that catches the most obvious failure modes. But the harder judgement calls (scope discipline, example honesty, whether the author has actually used the skill under load) still require human reading. Don't outsource the whole evaluation to a script.
What about skills that look thin but are actually deep?: They exist. A genuinely good skill in a narrow domain can be short, because the domain doesn't need many words. The tell is whether the brevity is disciplined or lazy: disciplined brevity covers scope, anti-triggers, and at least one example in few words; lazy brevity skips all three. Read carefully — short isn't bad, but vague is.
How often should I re-audit skills I've already installed?: Every few months, and immediately if a skill misbehaves in a session. Authors update; skills that were tight at publication can drift if the author keeps adding scope. A short quarterly sweep of your installed skills, removing the ones you haven't invoked in months, keeps the active set sharp.

Found a bug or want a topic covered? Email [email protected] or open an issue via GitHub.

Why Most Claude Code Skills Don't Deserve to Be Installed

The dirty secret of open catalogs

The median problem

Five failure modes to watch for

What a should-install skill looks like

The honest framing on catalog incentives

The 90-second install test

In defence of the open-catalog model

Advice to authors: less, sharper

Frequently asked questions

Categories

Use cases

Popular tags

Learn

Site