· 8 min read ·

How Smart Are Your Claude Code Skills?

I built a self-improving loop for Claude Code slash commands that observes failures, finds patterns, and proposes fixes. Skills that get better over time instead of quietly degrading.

dev ai claude workflow

I have about 15 slash commands in Claude Code. They write blog posts, generate proposals, cross-post to forums, create Instagram carousels, and handle deployments. They work well on day one. Then they start failing in ways I don’t notice until I’m staring at bad output and manually fixing things.

It’s not that the skills are poorly written. Skills are static files in a directory while everything around them changes. The codebase evolves. The platforms I’m posting to update their requirements. I learn what works and what doesn’t through actual use. But the .md file sitting in .claude/commands/ has no idea any of that happened.

I spent an afternoon building a feedback loop that fixes this. My skills now observe their own failures, accumulate evidence, and propose targeted improvements based on real usage patterns.

The Failure That Started This

I have a content generation skill that writes product review articles for submission to an external platform. The skill knows their exact format: 16:9 landscape photos, a rating scorecard, specific section structure, casual-knowledgeable tone. It’s about 200 lines of detailed instructions.

Last week I used it to write a tire review. Three things went wrong.

First, the skill tried to auto-crop my portrait photos to 16:9 landscape using sharp. It used the attention crop mode, which is supposed to focus on the most interesting region. Instead it produced a photo of nothing but sky and a truck roof. Then it overcorrected and cut the tire in half. After three attempts I gave up and cropped everything manually in Lightroom.

Second, I only had five photos of the product but seven H2 sections in the article. The skill filled the gap by dropping in photos of a different product as placeholders. A product review where half the photos show the wrong product. Anyone reading it would notice.

Third, the skill generated a rating scorecard as part of the HTML output. I spent fifteen minutes trying to figure out where to paste it in the WordPress editor before realizing the scorecard isn’t HTML at all. It’s rendered by a WordPress plugin through custom fields. The skill had no idea.

Three failures in one session. All fixable. All invisible to the skill.

The Observe-Inspect-Amend Loop

Credit goes to a post by Vasilije at Cognee about self-improving agent skills. The idea is simple: skills should have a memory of what happened when they ran. Without observation, failure is invisible. Without evidence, improvement is guessing.

I built two new slash commands and a log file.

/skill-log records what happened after any skill runs. It takes natural language input and structures it into a JSON-lines log file at .claude/skill-logs.jsonl. Each entry captures the skill name, what was attempted, whether it succeeded or failed, what the user corrected, and a severity level.

After that session, I logged three entries:

{"skill":"content-review","outcome":"partial","issue":"automated sharp crops produced bad framing","correction":"use Lightroom instead of automated cropping","category":"output-quality","severity":"high"}
{"skill":"content-review","outcome":"partial","issue":"wrong photos used as placeholders","correction":"only use photos of the actual product being reviewed","category":"wrong-content","severity":"medium"}
{"skill":"content-review","outcome":"partial","issue":"scorecard unclear whether HTML or WordPress custom fields","correction":"scores entered via WordPress custom fields not HTML","category":"missing-step","severity":"medium"}

That’s the observation layer. Just structured data about what went wrong.

/skill-improve reads the logs, identifies patterns, maps each pattern to a specific section of the skill file, and proposes targeted edits with evidence.

When I ran /skill-improve on the content review skill, it produced a report:

  • Pattern 1: Automated photo cropping fails on portrait images (1 log, severity high). The skill specifies 1024x576 dimensions but says nothing about how to get there. Proposed fix: add an explicit instruction that portrait-to-landscape crops require human judgment, include Lightroom export specs.

  • Pattern 2: Placeholder photos don’t match the product (1 log, severity medium). The skill says “all photos on YOUR actual truck” but doesn’t say “all photos must show the product being reviewed.” Proposed fix: require every photo to show the actual product, flag gaps instead of substituting.

  • Pattern 3: Scorecard submission method missing (1 log, severity medium). The skill generates scorecard HTML but doesn’t explain that the platform uses WordPress custom fields. Proposed fix: add a note that scores go in custom fields, not the HTML body.

Each proposal pointed to the exact lines in the skill file that needed to change, quoted the current text, and showed the replacement. I approved all three. Three surgical edits, each grounded in evidence from a real failure.

Why This Beats Manual Editing

I could have just opened the skill file and fixed these things by hand. I’ve done that dozens of times. The problem with manual fixes is threefold.

You fix what you remember. After a long session with fifteen corrections, you remember the last two. The other thirteen fade. The log captures all of them.

You fix symptoms, not patterns. “The crop was bad” is a symptom. “The skill has no guidance on how to handle portrait-to-landscape conversion” is the pattern. /skill-improve maps symptoms to the structural gap in the instructions.

You don’t know when to stop. Without data, every edit is equally important. With logs showing severity levels and frequency counts, you fix the high-severity recurring issues first and leave the edge cases alone.

The CLAUDE.md Half-Life Problem

Todd Saunders wrote about this recently: CLAUDE.md has a half-life. It starts clean and focused. Then you patch mistakes. “Don’t use this import.” “Never use this folder.” After a few weeks it’s 200 lines of negation patches and the important directives are diluted.

CLAUDE.md gets injected into every single interaction. Every extra line shrinks the context window available for reasoning about your actual code. A 50-line file with clear architectural intent gives Claude a better understanding of your project than a 1,000-line file full of “don’t do X” patches.

My CLAUDE.md was at 399 lines. Eleven negation patches. Entire sections duplicated between the file and my Claude memory system. Detailed feature documentation that belongs in memory, not in every prompt. Proposal rates and forum posting instructions that belong in skills, not in the global context.

I built a /claude-md-refactor skill that applies the same observe-and-improve philosophy to the CLAUDE.md file itself. It categorizes every section into four buckets:

  • KEEP: Architectural intent that affects every interaction (tech stack, deployment, project structure)
  • MOVE TO MEMORY: Learned knowledge that doesn’t need to be in every prompt (feature details, pricing strategy, gear lists)
  • MOVE TO SKILLS: Edge cases that only matter during specific tasks (proposal rates go to /proposal, forum config goes to the cross-posting skill)
  • DELETE: Outdated, redundant, or already captured elsewhere

I ran it today. 399 lines dropped to 85. The proposal section (31 lines) was already fully covered by the /proposal skill. The forum cross-posting section (29 lines) was already in its own skill. The video clip sales documentation (42 lines) was duplicated word-for-word in my memory system. The 18-item Lessons Learned section was mostly things already encoded in the code itself or captured in memory.

Every line I removed is context window I got back for actual reasoning.

How to Build This

It’s three files and a convention.

.claude/skill-logs.jsonl is an append-only JSON Lines file. One JSON object per line. Each entry has: timestamp, skill name, task description, outcome (success/partial/failure), issue, correction, category, and severity. No database. No dependencies. Just a text file.

.claude/commands/skill-log.md is a lightweight skill that takes natural language (“log that the carousel skill cut off the text on slide 3”) and parses it into a structured log entry. It infers the outcome from context, categorizes the issue, and appends to the JSONL file.

.claude/commands/skill-improve.md is the analysis skill. It reads logs for a specific skill, groups by category, identifies patterns, reads the current skill file, maps issues to specific sections, and proposes diffs with evidence and rationale. Never auto-applies. Always shows the evidence and asks for approval.

Convention is simple: after any skill runs and you give feedback (corrections, complaints, or praise), log it. Over time the logs accumulate patterns. When a skill starts underperforming, run /skill-improve and let the evidence guide the fix.

What I’d Build Next

Right now, the loop is semi-manual. I log observations by hand and run improvement analysis on demand. The next step is making observation automatic. If a skill produces output and the user immediately edits it, that’s a signal. If the user says “no, not like that” right after a skill runs, that’s a signal. Capturing those signals without requiring explicit logging would close the loop further.

Another missing piece is evaluation. Right now I eyeball the proposed changes and approve them. A proper evaluation step would run the amended skill against past inputs and compare outputs. Did the change actually improve things? The /skill-creator skill already has grader and comparator agents for exactly this. Wiring them into /skill-improve as an optional step would make the loop more rigorous.

But even without those additions, the basic loop works. Log what went wrong. Let the system find patterns. Apply targeted fixes grounded in evidence. Skills that improve themselves instead of quietly degrading.

That’s the whole idea. Three files and a habit.