Self-Improving Claude Code Skills, Part 2: Full Automation | BDIGITAL

In Part 1 I built a manual feedback loop for Claude Code skills. Two slash commands and a log file. /skill-log records what went wrong, /skill-improve analyzes patterns and proposes fixes. It works. But the last section of that post was titled “What I’d Build Next” and it described exactly the gap: the loop is semi-manual. You have to remember to log feedback. You have to remember to run the analysis. Nobody remembers to do either of those things after a long session.

Today I closed that gap. The system now observes its own failures, generates improvement proposals without being asked, and opens pull requests with evidence. It runs on the same infrastructure I use for the main site and the truck build blog. No manual logging. No manual analysis. Human review happens in the GitHub PR interface, which means I can approve changes from my phone while standing in line at the grocery store.

Here is what I built, how I tested it, and the three bugs it found in itself on day one.

The Observer Hook

First up is a SessionEnd hook. Claude Code fires this when any session ends. A Node.js script reads the session transcript, which is a JSONL file containing every message exchanged during the conversation.

It walks through the transcript looking for two things. First, it finds every Skill tool invocation and extracts the skill name and task description. Then it scans the messages that follow each skill invocation for correction signals.

Correction signals come in two forms. Textual signals are phrases like “no”, “wrong”, “fix”, “not like that”, “actually”, or “redo” in the user’s follow-up messages. Tool-based signals are Edit or Write tool calls that happen after a skill produces output, which means the user or Claude is manually fixing what the skill generated.

For each skill execution where corrections were detected, the observer appends a structured entry to .claude/skill-logs.jsonl with "source": "auto-observer" to distinguish it from manual /skill-log entries.

Constraints are tight. Pure Node.js, no npm dependencies. Must finish in under 10 seconds. Append-only writes so concurrent sessions don’t corrupt each other. Session ID deduplication so running the hook twice on the same transcript doesn’t create duplicate entries.

Nobody remembers to do either of those things after a long session.

The Analyzer

Next is a deterministic Node.js script that reads accumulated log entries and generates improvement proposals. No LLM calls. Pattern matching and heuristics only.

It reads a marker file to find the last processed timestamp, filters the logs to unprocessed entries, groups them by skill, and applies a threshold: a skill needs at least two log entries OR one high-severity entry before it gets proposals. This prevents noise. A single medium-severity correction isn’t enough evidence to change a skill file.

Proposals are conservative and additive. Warning notes for output quality and technical accuracy issues. Checklist items for missing steps. Format constraints for output problems. Every proposal includes the evidence that triggered it and the exact text to insert. Nothing gets removed from existing skills.

Output goes to docs/skill-improvements/YYYY-MM-DD.md as a changelog that doubles as a PR body.

The Daily PR

A GitHub Action runs at 6am Arizona time every day. It checks for unprocessed log entries, runs the analyzer, and if proposals were generated, creates a branch, commits the modified skill files and changelog, and opens a pull request with the evidence report as the body.

Each PR includes an evidence table showing every log entry that contributed to each proposal, the exact text changes, the category and severity breakdown, and a risk assessment. All proposals are additive, so the risk section says the same thing every time: “No existing behavior removed.”

I review the PR on my phone, merge or close it, and the skills are updated. If I want to make richer edits before merging, I can. The automation handles the analysis and the proposal. The human handles the judgment call.

First Test: Real Data

To see what it would capture from real usage, I ran the observer against three recent session transcripts.

It found five skill executions with correction signals across three sessions:

Session	Skill	Signal Type
Older session	`/write`	Corrective tool use after blog post generation
Older session	`/nano-banana`	Corrective tool use after OG image generation
Older session	cross-post skill	Corrective tool use after forum cross-post
Previous session	`/skill-improve`	Corrective tool use after analysis
Previous session	`/write`	Corrective tool use after blog post generation

All five observations were triggered by tool-based correction signals. The observer detected that Edit or Write tools were used after skill output, which means someone was manually fixing what the skill produced. None were triggered by negative textual signals from the user, meaning I made the corrections without explicitly complaining about them.

Deduplication works. Running the observer a second time against the same transcripts produced zero new entries.

Then I ran the analyzer in dry-run mode against the accumulated logs. It correctly processed the three existing manual entries for a content generation skill (the ones from Part 1) and generated one proposal. The output-quality category had a single entry with severity: high, which triggered the high-severity threshold. The other two categories each had one medium-severity entry, correctly falling below the evidence threshold.

It produced an additive warning note about photo cropping. Exactly the kind of conservative, evidence-grounded change the system is designed to produce.

The Bugs It Found in Itself

Here is where it gets interesting. Testing the system revealed three bugs, and the nature of those bugs is exactly the kind of thing the system itself is designed to catch.

Bug 1: Double period in generated warnings. The analyzer builds warning text by joining correction strings with . and appending a final .. But if a correction already ends with a period, you get let user crop manually in Lightroom instead of automated cropping.. with two periods. The fix: strip trailing periods from each correction before joining them.

Bug 2: Misleading dry-run output. The analyzer prints “Marker updated to: 2026-03-14T13:21:00Z” even in dry-run mode. The marker file is not actually modified because the write is correctly gated behind the DRY_RUN flag, but the console output makes it look like it was. The fix: wrap the log message in the same DRY_RUN check.

Bug 3: Indented commit message in GitHub Action. The heredoc body inside the git commit -m block was indented to match the surrounding YAML. Since it uses <<'EOF' (not <<-'EOF'), those leading spaces become part of the commit message. Every automated commit would have had ten spaces before the title. The fix: left-align the heredoc content.

All three were caught during the first test run. All three were fixed before pushing. And here is the part I find satisfying: the corrections I made to fix these bugs happened within a Claude Code session. When this session ends, the observer hook will fire, read the transcript, and detect that I made edits after using skills. Those corrections will be logged automatically. The system will eventually analyze its own bug fixes as evidence for future improvements.

That is the loop closing on itself.

The system accumulates knowledge from every session whether I remember to log feedback or not.

Three Skills Fixed in One Day

Once automation went live, within hours I had corrected three different skills based on real-world feedback. None of these required running /skill-improve. They were direct fixes triggered by things breaking in production.

A content submission skill generated HTML that the target platform rejected. The skill was adding anchor IDs to headings and generating a hand-coded Table of Contents. The platform already handles ToC generation via a plugin. The extra markup looked like bloated code to the editor reviewing the submission. The fix was two changes: remove the ToC from the skill’s section structure, and add a “clean HTML” rule specifying which tags are allowed.

A cross-posting skill linked to images that got renamed. I renamed an image directory on my site for a cleaner slug. The problem: a forum post on another platform embeds those images using absolute URLs. Forum posts can’t be bulk-updated. Four broken images. I tried a symlink first, but that doesn’t survive the Astro build and Cloudflare Workers deploy pipeline. The actual fix was copying the files back to the old path. Then I added a permanent rule to the skill: image paths are permanent once an external post references them. If you reorganize, copy files to preserve the old URLs.

This is the kind of lesson you learn once and forget twice. Now it’s encoded in the skill file where it can’t be forgotten.

A blog writing skill published posts without generating audio files. Every post has a read-aloud player component, but the TTS audio files hadn’t been generated. Play buttons appeared, nothing played. I had to go back, set up a Python venv, download 337MB of model files, and generate all four audio files retroactively. The fix: the writing skill now generates TTS audio automatically as part of its workflow, not as an afterthought. If the model files are missing, it tells you how to set them up instead of silently skipping the step.

All three fixes share the same pattern. Something broke in production. I fixed the immediate problem. Then I fixed the skill so it wouldn’t happen again. Without the skill improvement habit, I would have fixed the submission, the images, and the audio, but the underlying skills would keep producing the same failures on the next run.

All corrections from this session were captured automatically. The log now has 8 entries across 5 different skills. When the daily GitHub Action runs tomorrow morning, the analyzer will process them and generate proposals if any patterns have enough evidence. The system accumulates knowledge from every session whether I remember to log feedback or not.

What the Auto-Observer Misses

One blind spot is worth acknowledging. All five auto-captured entries from the test run were generic: “User manually edited skill output (detected corrective tool use).” The observer knows that corrections happened but not what was corrected. The category defaults to other and the severity defaults to medium because the generic detection message doesn’t match any specific keyword patterns.

Manual /skill-log entries are still more informative. When I type something like /skill-log the content skill generated bloated HTML with id attributes and manual ToC, the log captures the specific issue, the specific correction, and the right category. The auto-observer captures the signal. The manual logger captures the nuance.

Both have value. The auto-observer catches corrections you forgot to log. The manual logger provides richer evidence when you remember to use it. The analyzer treats them equally when generating proposals.

A future improvement would extract the actual diff from Edit tool results to populate the correction field with specifics about what changed. That would close the quality gap between auto and manual observations.

That is the difference between skills that degrade and skills that get better with use.

The Self-Improving Loop

Use a skill during a session
     |
Session ends -> observer reads transcript
     |
Corrections detected -> appended to skill-logs.jsonl
     |
6am next morning -> GitHub Action fires
     |
Analyzer reads new logs -> generates proposals
     |
PR created with evidence report
     |
Review on phone -> merge
     |
Skill improved, changelog committed

Every correction feeds back into the system. The skills I use most accumulate the most evidence. The skills that fail most frequently get the most proposals. The ones that work well generate no logs and receive no changes.

It is not a complicated system. It is a log file, two scripts, and a cron job. But it turns corrections into evidence, evidence into proposals, and proposals into improvements. That is the difference between skills that degrade and skills that get better with use.

In total, it’s five files: two scripts (observer and analyzer, ~380 lines combined), a GitHub Action workflow, a hook config, and an append-only log. Total CI cost is about 30 seconds per day, well within GitHub’s free tier. The code is in the bdigital.media repo.

Part 1 covered the manual observation loop. This is Part 2: the automation that makes the loop run itself. If you are building something similar or want to talk about Claude Code workflows, I wrote about the broader content pipeline and remote mobile workflow that these skills plug into.