Part 1: Designing Automated PR Reviews with Claude Skills | BDIGITAL

This is Part 1 of a three-part series on designing AI-powered automated PR review. Part 2 covers diagnosing false positives with cluster analysis. Part 3 covers the four structural fix patterns and the AI-auditing-AI problem.

Working code. Every snippet in this series has a runnable counterpart in bdigital-public, a companion repo with MIT-licensed samples, open-source-standard structure, and a drop-in GitHub Actions workflow. File-specific links appear throughout.

Picture this. Every pull request across your organization gets a thorough code review inside of two minutes. A senior engineer’s judgment applied to missing test coverage, inconsistent patterns, latent maintainability issues, security smells. Not at 10 AM on a Wednesday when the reviewer finally gets to their queue. Two minutes after the PR is opened, on every commit push, for every team, with zero human scheduling. That is the pitch for AI-powered PR review, and it is real.

Getting there is an engineering problem, not a prompt engineering problem. Hold onto that distinction. A reviewer that works well in a demo and falls apart on real PRs is not a tool; it is a science project that erodes trust every time a developer dismisses one of its findings. The gap between “this works on my three cherry-picked examples” and “this works on the 200 PRs that landed last week” is where the real design effort lives.

This series is about closing that gap. Closing it starts with a structural choice about how the reviewer itself is built: not as a monolithic prompt, not as a black-box vendor tool, but as a composition of Claude skills: named, sectioned markdown files that Claude Code interprets as reviewer instructions. That structural choice sounds like a detail. It carries the weight for everything that follows: how failures get diagnosed in Part 2, how fixes get applied in Part 3, and how the whole system earns trust with the developers it was built to help.

Why build review as Claude skills at all

A Claude skill is a markdown file with a defined shape. Frontmatter at the top (name, description, purpose). A system prompt that sets the reviewer’s role. Named sections below that list detection rules, exclusion categories, evidence requirements, scope filters, and output format. Claude Code loads the skill, interprets its sections, and runs the reviewer on a pull request’s diff.

Compare that to three alternatives teams reach for:

A monolithic prompt. One enormous prompt that tries to encode every review rule in prose. This works at first. By the time you have ten reviewers, the prompt is unreadable, every edit risks breaking something unrelated, and there is no way to ask “which rule fired this finding?” without re-reading the whole thing.

A vendor black box. A third-party AI review tool that flags issues but does not let you inspect or modify how it decides. This moves fast in the short term and becomes a trust liability the moment the tool’s false positive rate crosses any of the thresholds that Part 2 examines in detail. You cannot fix what you cannot see inside.

Pure classifier models. Fine-tuned models that emit structured findings. Powerful, but slow to iterate, hard to audit, and hard to extend with new rules without retraining. Teams that start here often end up bolting a prompt layer on top anyway.

A skill-based approach sits in the middle deliberately. Each reviewer is a versioned file in the same repo the developers work in, with named sections that map cleanly onto the kinds of edits code reviewers need to make: adjust a detection rule, add an exclusion, refine an evidence requirement. When a reviewer misfires, the fix lives in a specific named chunk of markdown. That structural property is the reason the cluster analysis in Part 2 produces surgical fixes instead of vibes.

Anatomy of a PR review skill

Every review skill in a well-designed system has the same shape, because reviewers with the same shape can share tooling, testing, and improvement workflows. A representative skill file looks like this:

Anatomy of a Claude PR review skill: a skill.md file with frontmatter plus seven named sections (system prompt, detection rules, examples, exclusion categories, evidence requirement, scope filter, output format), annotated with the four fix patterns that target specific sections when the skill misfires

Live examples of this shape live in samples/pr-review/.claude/skills/: correctness-reviewer.md, test-adequacy-reviewer.md, design-fit-reviewer.md, readability-reviewer.md, and breaking-change-reviewer.md. Each file shows frontmatter plus seven named sections in a production-ready configuration.

Frontmatter plus seven named sections, each doing one job:

Frontmatter declares the skill’s identity: its name, a short description of what it flags, and any metadata the rest of the system needs to find or categorize it. Think of this as the reviewer’s job title.

System prompt sets the role. “You are a senior engineer reviewing a pull request for missing test coverage.” Short, direct, and it establishes the reviewer’s scope. This section rarely changes after the first draft.

Detection rules list what to flag, in declarative bullet points. “Flag: a changed public function that has no corresponding test change.” “Flag: coverage configuration that excludes a directory containing application source code.” This is the main payload, and it is where the bulk of a skill’s intent lives.

Examples give the skill two or three inline reference cases, each a small diff paired with the expected finding (or an empty array for a negative case). These calibrate the detection rules at inference time the way a handful of labeled examples calibrate a teammate’s judgment during onboarding. Keep the count small; the full labeled corpus lives in a separate eval fixture (see the testing section below), not inside the skill markdown.

Exclusion categories name what to skip, grouped by reason rather than enumerated individually. Not “skip test/foo.py, skip test/bar.py, skip scripts/baz.sh” but “skip files whose path identifies them as non-production code.” Grouping by reason lets a reviewer handle novel cases it has not seen before.

Evidence requirement tells the reviewer what it must verify before emitting a finding. If the reviewer cites a file, the file has to exist. If it cites a commit hash, the hash has to appear in the git log. If the reviewer cannot verify a claim, it stays silent. This section is the single biggest lever against hallucinated findings.

Scope filter runs before any detection rule fires, as a deterministic path-based gate. “If all changed files match infrastructure-as-code paths, this is a config-only PR; skip code-quality checks entirely.” Scope filters do work the model cannot be trusted to do reliably, because they require zero judgment.

Output format specifies the shape of each finding: severity, file, line, concise rationale, code citation. A structured output format is what lets downstream tooling route findings into dashboards, PR comments, or merge gates without parsing prose.

What section structure makes possible

This section shape is not aesthetic. It is what makes the rest of the system possible.

Consider what happens when a reviewer emits a bad finding and a developer complains. In a monolithic prompt, the triage conversation sounds like “the prompt should probably not do that” and ends with someone adding a sentence to the bottom of a wall of text, hoping the model will notice. In a section-oriented skill, the same triage produces a specific, testable claim: the detection rules section matched this code, and the exclusion list should have suppressed the finding but did not. The fix is an edit to the exclusion section, and the improvement can be verified against the specific audit cases that triggered the complaint.

Section structure also makes automated improvement possible. If you can point to which section an audited failure touches, you can bulk-edit that section across a portfolio of reviewers. “Every reviewer that emits findings about configuration files needs an updated scope filter.” That is a two-line grep and a batch edit, because every reviewer has a scope-filter section in the same place.

Perhaps most importantly, section structure changes what you measure when you measure a reviewer’s accuracy. Failure modes that Part 2 examines (ignored exclusion lists, hallucinated evidence, configuration treated as production code) each map to a specific section of a skill file. That mapping is what turns “our reviewer is too noisy” into an engineering backlog instead of a philosophical debate.

Getting the foundation right

Three decisions land on the table as soon as a team starts designing review skills. All three are easier to get right on day one than to retrofit.

One skill per reviewer role, not one skill per team. There is a temptation to give every team a single omnibus reviewer that checks everything. Resist it. An omnibus reviewer becomes the monolithic prompt problem in disguise. One skill for test coverage, one for codebase consistency, one for maintainability, one for intent. Each skill has a clear job and a clear audit story. Composition happens at the orchestration layer, not inside the prompt.

Version the skills as code, not as config. Skills live in a git repo alongside the code they review. Changes go through pull requests and code review, because skills are code that runs in production. Treating them as config, sitting in a settings UI or a database, makes them invisible to the normal software engineering workflow and removes the history that makes debugging possible.

Build the evaluation harness before the first reviewer. Part 2’s cluster analysis needs hundreds of findings labeled true- or false-positive with a short justification per entry. The sample’s eval system has two layers: two or three inline reference cases in each skill’s Examples section, plus a separate eval fixture per skill (evals/<skill>.json) with additional positive and negative cases. A small runner (run-evals.mjs) loops through each case, calls the skill as an Executor, then calls an independent Grader with a different system prompt (and, by default, a different model tier) to decide pass or fail. Anthropic’s skill-creator toolkit formalizes this Executor/Grader split; the sample mirrors it so audit bias is bounded by the grader’s independence rather than hidden by shared context with the reviewer.

Six skill categories to run at a minimum

One skill per reviewer role is the right structure. Next question: which reviewer roles actually belong in the minimum viable system? Ship too few and the tool misses the categories developers care about. Ship too many and the review output becomes the same noise wall the cluster analysis in Part 2 is designed to fix.

Cross-referencing Google’s engineering-practices code review guide, Sonar’s Clean Code taxonomy, the OWASP Code Review Guide, GitHub Advanced Security, Snyk, Qlty (formerly CodeClimate), and Anthropic’s Claude Code best practices produces a consistent top six for what LLM-powered review adds over deterministic tooling. Sadowski et al.’s “Modern Code Review” study at Google is the durable empirical anchor: reviewers in practice most often flag design, readability, and functional correctness, and the categories below trace directly to those findings.

Six skill categories that every minimum-viable AI PR reviewer should ship, arranged in a 2x3 grid with source attributions, plus a four-item strip of concerns better handled by deterministic tooling

Correctness and logic bugs. Intent-versus-implementation review. Google lists it under “Functionality” as a primary reviewer concern, Sonar calls it Reliability, and Anthropic’s Claude Code guidance treats correctness/functional review as the primary reviewer task. An LLM reasoning about diff intent catches what linters cannot: off-by-one errors, wrong branch taken, misread spec, missing edge cases.
Security-sensitive patterns. Authentication and authorization checks, injection vectors, unsafe deserialization, crypto misuse. OWASP, Sonar, GitHub Advanced Security, and Snyk all center on this category. An LLM adds taint-tracing reasoning over diff context that deterministic SAST rules miss, especially for business-logic authorization that cannot be expressed as a generic rule.
Test adequacy. Whether the tests added actually exercise the changed behavior and would fail on regression. Coverage percentage alone does not answer this, and it is one of the few categories where Google and Anthropic both explicitly say humans (and LLMs standing in for them) beat tools.
Design fit and over-engineering. Does this change belong here, now, at this scope? Google’s top-ranked reviewer concern. Nothing deterministic can answer it, which makes it the LLM skill with the biggest payoff in the set and the one teams consistently underinvest in because it cannot be graded against a number.
Readability: naming, comments, cognitive complexity. Formatters handle whitespace. Metric tools flag complexity numbers. An LLM is needed to judge whether a name or comment actually communicates intent to the next developer who reads the file, and readability is the category most often cited in reviewer-behavior studies as what reviewers actually spend their time on.
Breaking-change and public-contract impact. API signatures, schemas, migrations, doc drift. Highest cost-to-revert risk on the list, and the only category whose scope is the diff itself rather than the code broadly. Best positioned as its own skill because the exclusion rules differ substantially from the others; most files are not part of a public contract. Unlike the first five, this category is our synthesis; none of the cited sources names it as a distinct reviewer role.

What to leave to deterministic tooling

Four categories look essential but are better handled by existing deterministic tools than by any LLM skill. An AI reviewer that duplicates them wastes tokens and adds nit-noise that shows up in Part 2’s cluster analysis as pure false positive volume. Name the boundary explicitly:

Style and formatting belong to Prettier, gofmt, Black, rustfmt.
Known-CVE dependency scanning and license compliance belong to Dependabot, Snyk Open Source, OSV. Database lookup, not reasoning.
Secret scanning belongs to GitHub secret scanning, TruffleHog, or gitleaks. Regex and entropy beat LLM judgment for this.
Deterministic lint rules (unused imports, dead code, type errors) belong to ESLint, ruff, tsc. An LLM skill should read their output when it needs to, not replicate it.

Two rules of thumb come out of this division:

An LLM skill earns its keep where reasoning beats rules. Correctness, security business logic, test adequacy, design, readability, breaking-change impact. Each requires diff-aware judgment that a deterministic tool cannot replicate.

Do not ship a seventh category before the first six are stable. Each additional skill multiplies the review surface, the audit volume, and the false positive count. Holding the line at six means cluster analysis stays tractable, and the fix patterns in Part 3 land on a smaller, better-understood set of reviewers.

Where this runs

Everything above described skills as markdown. What it did not describe is where those markdown files get executed on a real pull request. GitHub Actions is the reference example used here: universal, free on public repos, and the execution shape translates cleanly to Jenkins, GitLab CI, Buildkite, or any internal self-hosted runner setup.

Execution shape is the same across every CI system:

Trigger on PR open or push.
Check out the diff.
Invoke each review skill with the diff as input.
Post findings back as PR review comments.

A minimal GitHub Actions workflow that runs every skill in .claude/skills/ on each new PR looks like this:

name: AI PR Review

on:
  pull_request:
    types: [opened, synchronize]

permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      # Full-repo checkout is non-negotiable. The reviewer needs to
      # read files referenced by the diff (imports, callers, tests,
      # config) to verify evidence. Diff-only context cannot do that.
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Compute PR diff
        run: git diff origin/${{ github.base_ref }}...HEAD > /tmp/pr.diff

      - name: Run review skills
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        # Wrapper invokes Claude Code CLI inside the cloned repo, so
        # the skill can Read any file to verify claims against actual
        # code, not just the lines in the diff buffer.
        run: node scripts/run-reviews.mjs \
          --skills .claude/skills \
          --diff /tmp/pr.diff \
          --repo . \
          --out /tmp/findings.json

      - name: Post findings as PR comments
        uses: actions/github-script@v7
        with:
          script: |
            const findings = require('/tmp/findings.json');
            for (const f of findings) {
              await github.rest.pulls.createReviewComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                pull_number: context.issue.number,
                body: `**${f.skill}**: ${f.rationale}`,
                path: f.file,
                line: f.line,
                commit_id: context.payload.pull_request.head.sha
              });
            }

That run-reviews.mjs script invokes Claude Code CLI in headless mode inside the cloned repo. For each skill markdown file, it passes the skill content as the system prompt and the diff as the user message, but because the CLI runs with its default Read/Grep/Glob tools enabled, the skill can read any file in the repo during its reasoning. This is what makes the evidence verification gate from Part 3 actually possible: when a finding claims “this function has no test,” the skill can grep for the function name and check. When it cites a commit SHA, it can confirm the SHA exists in git log. Without full-repo access, those verifications degrade into “the model hopes it is right.”

The full workflow with per-skill error handling, Node 20 setup, and PR-comment posting lives at samples/pr-review/.github/workflows/pr-review.yml. The wrapper script is at samples/pr-review/scripts/run-reviews.mjs. Both are copy-paste-ready and MIT licensed.

Five details keep the workflow trustworthy in production:

Full-repo checkout is a correctness requirement, not a performance optimization. fetch-depth: 0 gets you the full history AND the full working tree. Shallow clones break evidence verification because the skill cannot read files outside the diff. This is the single most common mistake when teams port a workflow from another repo.
Grant the reviewer read-only filesystem access to the repo, not network access to external services. The Claude Code CLI’s default tool set (Read, Grep, Glob, Bash) is the right minimum.
pull-requests: write is the minimum GitHub permission needed to post comments. Leave contents: read and never give the reviewer write access to code.
ANTHROPIC_API_KEY should be a repo-scoped secret with no broader access than the workflow needs, and rotated if the repo ever becomes public.
Run each skill in its own step when the count is small. That keeps the Actions log navigable when a skill misfires, because diagnosis gets harder fast when everything runs in one command and only the aggregate output is visible.

Enterprise infrastructure looks more complex (self-hosted runners, custom secrets management, policy gates that block merges on specific severities, findings routed to a dashboard instead of inline comments), but the conceptual shape stays the same. Fix patterns from Part 3 do not care which CI system you use, because they are edits to the skill markdown, not the runner.

What is next

The skills described so far are the what. They describe the shape of a review system ready to be iterated on seriously. What they do not yet address is how reviewers behave once they see hundreds of real pull requests, and specifically the first problem every team encounters: false positive rates that erode developer trust.

Part 2 tackles that diagnosis head-on, covering why trust collapses in discrete steps rather than linearly, a four-step cluster analysis methodology for finding the root causes of false positives across hundreds of audited findings, three failure modes that appear independently across nearly every review system, and what the data looks like when you audit honestly for the first time.

Part 3 covers the fixes: four structural patterns that map directly onto the named sections of a review skill, the meta-problem of using AI to audit AI, and the two external bodies of work (multi-review aggregation and independent grading) that make the whole methodology defensible in production.

The series continues with Part 2: Diagnosing False Positives and Part 3: Four Fix Patterns.