Your Claude Skill Is a Pile of Markdown. Ship It Like Software.
Most Claude skills ship without a software lifecycle. This post documents the six-phase one I used on session-handoff: behavioral eval, packaged distribution, independent code review, CI for drift and portability, end-to-end install verification (with a supply-chain trust model), and governance. Each phase has a specific cost, a specific payoff, and catches a specific class of bug the author cannot catch themselves.
dev ai claude workflow eval plugins
Claude skills are software. They have inputs, outputs, dependencies, distribution, upgrade paths, and users who installed them from a link they clicked once. Most skills I’ve read online were written, hand-tested once, and committed. That’s the quality bar for a gist, not for something you invite other people to install into their editor.
This post documents the software development lifecycle I used for one skill, all the way through. It’s not long, not heavy, not enterprise. It’s about two hours of work spread across six phases, each of which catches a class of bug the phase before it cannot. The skill is session-handoff — it generates a brief for resuming work in a fresh Claude Code session after /clear. It’s the running example; the discipline is the point.
Working code. The full skill, plugin manifest, 36-assertion structural test suite, three eval scenarios, and CI drift-check job all live in the companion repo neurot1cal/bdigital-public (PR #10 shipped the sample; PR #11 added the installable plugin). Read the source at plugins/session-handoff/skills/session-handoff/SKILL.md — or install in two commands inside Claude Code:
Each phase below: what the discipline is, what it caught on this skill, what it costs to run, and what it’s actually testing that the author alone cannot.
Phase 1: Behavioral evaluation
What it tests. When the skill fires, does it produce different (better) output than baseline Claude would without the skill? Not “does the skill work in isolation” — that question grades on its own curve. The question is whether the skill’s existence changes anything.
How to run it.skill-creator ships a behavioral pipeline. Write 3–5 scenarios covering the shapes your skill should handle. Fire one subagent per scenario per condition — once with the skill loaded, once without. Assert programmatically on the outputs.
For session-handoff I wrote three scenarios, each baked with a simulated session history:
[
{"name": "trivial-session",
"prompt": "Q&A session, no code changes...generate a handoff",
"expected": "Short-circuit. Decline. Do NOT produce a template."},
{"name": "multi-feature",
"prompt": "90 minutes, three threads (new skill auth, docs fix, half-finished overrides)...",
"expected": "Rich handoff covering all three threads."},
{"name": "deep-debugging",
"prompt": "2-hour debug session, rejected hypotheses, root cause, partial fix, CI still running...",
"expected": "What Didn't Work, Key Findings, Decisions, Next Steps. No source dumps."}
]
Fourteen programmatic assertions across the three scenarios: section presence, specific phrases, word-count budgets, absence of full source dumps.
What it caught. The aggregate numbers looked fine:
Metric
With Skill
Without Skill
Delta
Pass Rate
100% ± 0%
89% ± 19%
+11pp
Time
24.6s
18.4s
+6.2s
Tokens
21,525
17,915
+3,609
The per-eval breakdown told the real story:
Ten of the fourteen assertions — every assertion on the substantive scenarios — pass regardless of whether the skill is loaded. Baseline Claude, given “generate a session handoff” with a reconstructed session context, produces a serviceable handoff on its own. It names the three threads. It captures rejected hypotheses. It flags follow-up items. The skill’s structural scaffolding is non-discriminating.
The only assertion that discriminated between conditions was on the trivial scenario: “output under 100 words (short-circuited, not padded).”
I had built the skill believing its value was structure — canonical section names, a word budget, a Git Workflow footer. The eval said no. Its value is judgment: knowing when not to write a handoff. Default-helpful is a failure mode, and the skill’s short-circuit is the load-bearing behavior.
What it’s testing that you can’t test alone. Your mental model of what the skill is for. I couldn’t have learned “the sections don’t matter; the short-circuit does” from reading the SKILL.md or hand-running the skill on a few sessions. I could only learn it from the counterfactual — what does baseline Claude produce, and where does that differ from what the skill produces? That’s the Executor/Grader discipline applied to your own output: let a separate process tell you where the skill is actually earning its keep.
Phase 2: Packaged distribution
What it tests. Can a stranger install your skill in one or two commands they can copy-paste from your blog post? If the answer involves “copy this file to that path, then edit this config,” you have a reference, not a release.
How to run it. Claude Code supports two distribution shapes. A sample is a readable, copy-paste reference under your repo’s samples/ directory — readers study or vendor it into their own repo. A plugin is installable in one command via /plugin marketplace add and /plugin install, with a versioned manifest and a proper distribution path.
For session-handoff I shipped both, from the same repo:
.claude-plugin/
└── marketplace.json # makes the repo a marketplace
The source: "./plugins/session-handoff" line is the critical one. It tells Claude Code where to find the plugin directory relative to the marketplace root. For plugins vendored in the same repo, the bare-string form works; for external plugins the object form with url and sha pins the exact revision.
What it caught. The shape discipline forced a choice I had been avoiding: is this skill a study material or an installable tool? The right answer for most skills is “both” — so long as the two copies of SKILL.md stay byte-identical. That drift risk creates a downstream requirement (see Phase 4).
It also caught the gap between “it works on my machine” and “it works on a fresh install.” My original install docs told readers to cp samples/session-handoff/SKILL.md ~/.claude/skills/session-handoff/. The plugin path reduces that to two slash commands. If the blog post tells readers to run a command, the command should work out of the box.
What it’s testing that you can’t test alone. Distribution is inherently external to the author’s machine. The manifest is either valid or it isn’t; the source field either resolves or it doesn’t. This is the first phase where “does it work?” has a binary answer another machine can verify.
Phase 3: Independent code review
What it tests. Everything that’s obvious in hindsight but invisible to the author.
How to run it. Spawn a fresh subagent with the superpowers:code-reviewer (or equivalent) role. Give it the PR branch, the diff, and a punch-list prompt. Critically: no conversation context from the session that produced the PR. The value is the clean read.
On PR #11 I ran a review pass that came back with 8 items across manifest correctness, duplication risk, README accuracy, security posture, and CONTRIBUTING.md alignment.
What it caught on this skill. Five findings were blockers or near-blockers:
allowed-tools included unused tools. My SKILL.md declared Glob and Grep permissions that the procedure never uses. For a user-invocable skill anyone can install, granting tool access you don’t need is a security posture problem. Fixed by scoping down to the minimum set.
Log path collision risk. The skill saved handoff logs to ~/.claude/skills/session-handoff/logs/. That path is outside the plugin install root, so it survives /plugin uninstall and collides with any other marketplace that ships a plugin named session-handoff. Fix: save to ~/.claude/data/session-handoff/logs/ — a user-data namespace both plugins and manual installs respect.
LICENSE not reachable from the plugin directory. My plugin README linked ../../LICENSE, which works on GitHub but breaks once the plugin is vendored into ~/.claude/plugins/cache/bdigital-public/.... MIT requires the license notice travel with the code. Fix: copy LICENSE into plugins/session-handoff/LICENSE so it ships alongside.
Install commands untested end-to-end. My PR body claimed /plugin marketplace add neurot1cal/bdigital-public worked — which I had not actually run against my branch. Shipping install instructions you haven’t tested is how blog-post install commands embarrass the author in public.
/plugin uninstall needs a marketplace qualifier.session-handoff is a generic name. If a second marketplace ships a plugin with the same name, the bare uninstall command becomes ambiguous. Fix: document the qualified form session-handoff@bdigital-public from day one.
Three more findings (drift risk, missing plugin-authoring guidance in CONTRIBUTING.md, missing plugin: prefix in pr-hygiene) were also legitimate and got fixed in the same pass.
The reviewer got one thing wrong: they flagged my TaskList tool declaration as invalid, claiming the tool is called TodoWrite. TaskList is a real tool in current Claude Code; TodoWrite was real in older versions. Fix: declare both in allowed-tools for cross-version compatibility. Lesson: review-generated findings are signal, not gospel — verify each one.
What it’s testing that you can’t test alone. The reviewer saw the diff without the narrative that produced it. Everything I would have agreed with if pointed out is exactly what needs pointing out. The LICENSE link, the log path collision, the untested install command — none were subtle. They were invisible to me because I’d been staring at the surrounding context for an hour. A reviewer with no context sees the surface straight through.
Phase 4: CI for drift and portability
What it tests. The stuff that happens between your machine and a contributor’s machine — unsynced file copies, shell version differences, package manager quirks.
How to run it. A few lines of YAML in a GitHub Actions workflow. For session-handoff, two steps:
session-handoff:
name: session-handoff skill integrity
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Verify samples/ and plugins/ SKILL.md are identical
Step one: diff -q fails the build if the two copies of SKILL.md drift. Step two: the 36-assertion structural regression suite runs against the plugin copy (the samples copy already passes by virtue of the diff check).
What it caught on first run. The CI job failed immediately — for a reason neither I nor the reviewer had anticipated.
The structural test script had been passing locally against macOS bash 3.2 for weeks. In CI, under Ubuntu’s bash 5.2 running as /usr/bin/bash -e, it aborted after printing the first section header. No assertions ran.
The culprit:
Terminal window
pass() { ((PASS++)); printf" ✓ %s\n""$1"; }
((PASS++)) returns the pre-increment value as its exit status. When PASS starts at 0, the expression evaluates to 0 — which bash maps to exit status 1. Under set -e, that aborts on first call. Older bash versions don’t enforce this; newer ones do. The one-line fix:
Terminal window
pass() { PASS=$((PASS+1)); printf" ✓ %s\n""$1"; }
VAR=$((VAR+1)) always exits 0. CI went green on run two.
What it’s testing that you can’t test alone. Your machine is one machine. The CI job runs your tests in a different environment — different shell, different OS, different tool versions — and tells you whether your work actually generalizes. The bash strict-mode bug was invisible on my laptop. It was caught on first contact in CI. That’s the entire value proposition: some bugs only exist at the seam between two environments, and CI is the only place that seam lives in your development loop.
Phase 5: End-to-end install verification
What it tests. Whether the command you told readers to type actually works. Nothing more.
How to run it. Tear down any local copy of the skill. Run the exact commands from the blog post. Confirm the skill loads.
For session-handoff, the test looked like this:
$ rm -rf ~/.claude/skills/session-handoff
(Skill gone. No filesystem copy that could be masking a broken install.)
The 9 plugins count was one higher than the previous reload. In the available-skills list, session-handoff now appeared as session-handoff:session-handoff — plugin-namespaced, confirming it loaded from the plugin cache rather than a loose local file.
The attempt also verified a separate contract: skill names matter. I fat-fingered sessoin-handdoff twice before getting the spelling right. Both typos returned Plugin "sessoin-handdoff" not found in any marketplace — clean failure, no partial install, clear error.
What it’s testing that you can’t test alone. The literal published command. Every one of the prior phases was preparatory. This is the only one that answers “will the blog-post install command work for a reader I’ll never meet?” — and it answers it in ten seconds.
This is also where my global rule “never claim live without verification” applies. No “installed, live, go” until both CI is green AND the install worked end-to-end. The commit message can say verified end-to-end; the blog-post subtitle can claim the install works. Before both checks return green, neither is accurate.
Phase 5b: Trust model (for both directions)
What it tests. Whether installers understand what they’re opting into, and whether your own repo’s marketplace is resistant to supply-chain manipulation.
Why it’s its own concern./plugin install is equivalent in spirit to curl | bash. It loads executable instructions (SKILL.md) and tool grants (allowed-tools) from an arbitrary git repo. Claude Code’s permission prompts narrow the blast radius but don’t eliminate it. A skill with allowed-tools: Bash, Write, WebFetch can do anything a shell can do — and once installed, its instructions become part of every session until uninstalled.
The attack surface has six components worth enumerating:
What to do as a publisher. Two concrete moves:
Publish a trust model. Your plugin README should state what the skill reads, what it writes, what it does not do (no network, no destructive commands, no secret exfil), and how to verify the install is coming from the right marketplace. session-handoff’s README is the template I landed on.
SHA-pin external marketplace entries. If your marketplace ever ships a plugin that lives in someone else’s repo, the source field must use the object form with a pinned sha:
Without the sha, every /plugin marketplace update is a trust-me-bro moment with the upstream maintainer. Bumping the SHA for a new upstream release is a normal PR to your marketplace and goes through your own review.
What to do as an installer. Five habits:
Read the SKILL.md before installing. The install command doesn’t prompt you. 60 seconds of reading catches the obvious attacks.
Verify the marketplace owner./plugin marketplace add neurot1cal/bdigital-public — note the owner. Any other owner is a typosquat.
Prefer pinned marketplaces. External plugins in a trustworthy marketplace should be SHA-pinned. If you can’t tell, assume they aren’t.
Don’t click through permission prompts. A newly-installed skill that wants WebFetch on first invocation has earned a pause.
Scope narrowly. A plugin declaring Bash, Write, WebFetch together with no matching explanation in its README is a smell.
What it’s testing that you can’t test alone. The trust model between author and installer is bilateral. An author can’t see whether their install command is being clicked by a careful reader or by someone running through a tutorial on autopilot. An installer can’t see whether the marketplace manifest is pinned or drifting. Both sides write down what they’d want the other to read. The effort on either side is small; the effort to recover from a successful supply-chain attack is not.
Phase 6: Governance
What it tests. Whether the next contributor — you in six months, or a stranger opening their first PR — goes through the same discipline.
How to run it. Make the discipline structural, not folkloric.
For this repo that meant four edits after the first plugin landed:
plugin: added to the pr-hygiene allowlist. Conventional-commit enforcement in the repo’s pr-hygiene Action already covered feat:, fix:, sample:, skill:, etc. plugin: didn’t exist. Adding it means future PRs that touch plugins/ get a canonical prefix.
“New plugin” checkbox added to the PR template. Plus a new checklist row: “If a plugin under plugins/ changed, .claude-plugin/plugin.json is valid JSON and the marketplace entry matches its name; LICENSE is reachable from the plugin directory.”
Plugin-authoring guide added to CONTRIBUTING.md. Layout requirements, marketplace entry shape, allowed-tools scoping, user-data path conventions, the samples-vs-plugins duplication policy, and the verification steps before opening a PR. The next person adding a plugin has a spec to follow.
CI drift job is a template. Any future plugin that also ships a samples/ copy can clone the session-handoff skill integrity job, change the two paths, and have drift detection for free.
What it’s testing that you can’t test alone. Your future self. None of the above matters if the next PR silently invents a different plugin shape, renames log paths, or drops the LICENSE file. The governance layer is what keeps the discipline from decaying the moment you stop paying attention.
Running this on your own skill
The playbook is short:
Evaluate./plugin marketplace add anthropics/skill-creator. Write 3–5 scenarios covering different shapes your skill should handle (including a decline case if the skill should know when not to fire). Fire with-skill and without-skill runs in parallel. Look at the per-assertion breakdown, not just the mean. Non-discriminating assertions are telling you that part of the skill isn’t pulling its weight.
Package. Add .claude-plugin/marketplace.json at the repo root, plugins/<name>/.claude-plugin/plugin.json as the plugin manifest, and a README.md with install instructions. Copy LICENSE into the plugin directory.
Review. Spawn a code-review subagent against the PR branch with no conversation context. Get a punch list. Fix every legitimate item; push back on the wrong ones after verifying (not before).
CI. If you ship a samples/ copy alongside the plugin, add a diff -q job. Run your skill’s structural test suite in a fresh container, not just on your laptop.
Verify install. Remove any local copy of the skill, run the blog-post install commands, confirm the skill loads. Only after that: claim the install works.
Govern. Every discipline you just used, write into CONTRIBUTING.md and the PR template so the next plugin goes through the same loop without you.
Closing
Skills aren’t gists. If you’re inviting someone to install something into their editor, the install experience is part of the release — and “the install command works” is a test with a binary answer that someone other than you can verify.
The total cost of the loop above was about two hours, most of which was the review + fix cycle. The eval was 60 seconds of subagent calls. The CI drift job was 30 lines of YAML. The install verification was three slash commands.
Each phase caught a class of bug the phase before it couldn’t. Behavioral eval reframed what the skill was for. Packaging surfaced the duplication and LICENSE-distribution questions. Review caught the log-path collision and untested install claims. CI caught a portability bug on first contact. Install verification confirmed the blog-post command worked against a torn-down-and-rebuilt install. Governance is what keeps the discipline from being a one-off.
Engineering discipline is not a toy for large codebases. It’s what separates software someone uses from software the author says works.