Part 3: Four Fix Patterns for AI Code Review (and the AI-Auditing-AI Problem) | BDIGITAL

Part 3 of a three-part series on designing AI-powered automated PR review. Part 1 covered the anatomy of a Claude review skill. Part 2 covered cluster analysis for diagnosing false positives.

Working code. Every fix pattern in this post appears in a working skill file in bdigital-public/samples/pr-review/.claude/skills/. Open any of the five skill files and you can see Pattern 01 in the detection-rules section, Pattern 02 in the exclusion-categories section, Pattern 03 in the evidence-requirement section, and Pattern 04 in the scope-filter section.

Part 2 established how to diagnose false positives systematically: cluster by root cause, not by symptom, and expect to find three recurring failure modes (ignored exclusion lists, hallucinated evidence, configuration treated as production code). This post covers the fixes.

Four structural patterns move a review skill from noisy to trustworthy. Each pattern targets a specific named section of the skill file described in Part 1, which is what makes them reusable, auditable, and easy to apply across a portfolio of reviewers. After the patterns, the post covers a meta-lesson about instruction salience that changes how to read prompt failures broadly, and the recursive trust problem that shows up whenever the audit itself is done by another LLM.

Four reusable fix patterns

Cluster analysis gives you a diagnosis. These four patterns are the treatments. Each addresses a specific structural weakness in how instructions are encoded into a review skill.

Four fix patterns in a 2x2 grid: pre-flag verification checklist (targets detection rules), structured exclusion categories (targets exclusion list), evidence verification gate (targets evidence requirement), path-based early exit (targets scope filter)

Pattern one: pre-flag verification checklists

Replace passive “do not flag” lists with an active checklist the model must evaluate before emitting any finding. Lives in the detection-rules section of the skill.

## Before Flagging: Verify ALL of These

Before raising any finding, confirm:
1. The flagged code has actual additions in the diff (not pre-existing code).
2. The code is production code (not test infra, scripts, build config, or mock data).
3. No existing test or handler already covers this concern.
4. Your factual claims are verified: the file, import, or function you reference
   actually exists (or does not) as you claim.
5. You can point to a specific code path that triggers the concern, not a
   hypothetical scenario.

If ANY check fails, do not flag.

Structural change: the checklist sits inline at the decision point, not in a separate section the model may have already forgotten by the time it is deciding whether to flag. This converts passive exclusions into active gates that activate at the right moment in the reasoning flow. The cluster analysis in Part 2 consistently shows this single pattern eliminating roughly 40% of false positives across a portfolio of reviewers.

Pattern two: structured exclusion categories

Flat fifteen-to-twenty-item bullet lists do not help a model generalize. Named, visually distinct categories do. Lives in the exclusion-categories section.

Not production code: test infrastructure, scripts, build configs, mock data, fixtures.
No logic to test: constants, data declarations, framework wiring, trivial setters.
Already covered: private helpers tested indirectly, concerns handled by a caller or framework.
Not new in this PR: pre-existing code, file moves, visibility-only changes.
Out of scope: style suggestions, lint-level issues, concerns better handled by security scanners.

Grouping by reason for exclusion rather than listing individual cases helps the model handle cases it has not seen before. When it encounters a new situation, it can ask “is this test infrastructure?” rather than scanning a list for an exact match.

Pattern three: evidence verification gates

For any reviewer that makes factual claims about its target system, add an explicit verification gate. Lives in the evidence-requirement section.

## Evidence Requirement

Every factual claim in a finding MUST be verified:
- If you cite a commit SHA, confirm it exists in the target repo's git log.
- If you claim a file is missing, confirm by checking the directory.
- If you claim an import is absent, check the top of the file.
- If you cannot verify a claim, do not make it.

When evidence is unavailable, emit zero findings.
Silence is correct when verification is impossible.

Silence-as-default does most of the work. LLMs have a strong completion bias. Without explicit permission to produce zero findings, a model will confabulate evidence to fill the void. Telling it that silence is the correct default when evidence is weak dramatically reduces hallucination-based false positives, which are the rare-but-trust-destroying category from Part 2.

Pattern four: path-based early exits

For the configuration and infrastructure problem, add a deterministic pre-filter based on file paths. Lives in the scope-filter section, and runs before any detection rule fires.

## Pre-Analysis: Scope Check

If ALL changed files match these patterns, this is a configuration-only PR.
Skip code-quality rules entirely, only check for syntax errors:
- Environment config directories (env/, config/, values.yaml, tfvars)
- Infrastructure-as-code (*.tf, Dockerfile, pipeline configs)
- Build tooling (Makefile, *.config.*, CI pipeline definitions)

If a file path contains test/, fit/, scripts/, example/, or tools/,
it is non-production code. Apply lower scrutiny.

Path-based filters are deterministic and require zero judgment from the model. They preempt the “this looks risky” instinct by removing a file from consideration before detection rules fire at all.

Projected impact

Modeling accuracy as “current true positives divided by current findings, minus cluster volume if a fix eliminates it” gives a fast way to rank structural changes by their expected payoff. Applied to a portfolio of seven review skills with accuracies ranging from 19% to 56%, projections move most skills into the 75% to 92% range on paper. Reviewers drowning in configuration false positives and missing-context hallucinations move from “unusable” to “worth a developer’s attention” with a single round of structural fixes.

Projections are not guarantees. Each fix depends on the model following the new instructions, and novel false positive patterns emerge on new code. Even conservative estimates turn most reviewers from noise generators into net-positive contributors, and a second iteration of the same process typically closes the remaining gap.

The meta-lesson: salience beats content

Most surprising across the methodology is not any individual cluster: review skills typically already contain the correct exclusions for almost every false positive observed. Models are not missing rules; they are failing to apply rules they already have.

Same exclusion information in two prompt structures. Buried at the bottom: signal drowns by the time the model decides. Inline at the decision point: passive rule becomes active filter

A direct implication follows for anyone doing prompt engineering across a portfolio of reviewers. Adding more instructions to a prompt does not guarantee the model will follow them. Instruction salience, meaning where rules appear, how they are structured, and whether they activate at the right moment in the model’s reasoning, is at least as important as instruction content.

Three principles follow:

Put verification gates at the decision point, not in a separate section. A “do not flag” list at the bottom of a prompt loses salience by the time the model decides whether to flag. An inline checklist at the moment of decision activates reliably.

Deterministic filters beat judgment calls. Path-based early exits are nearly 100% effective. “Apply lower scrutiny to non-production code” requires judgment the model exercises inconsistently. When you can replace a judgment call with a path check or a regex, do it.

Give explicit permission to produce zero output. Without this, LLMs confabulate to fill the void. One of the most valuable single sentences you can add to any reviewer prompt is: “Silence is correct when evidence is unavailable.”

AI reviewing AI: the recursive trust problem

A meta-problem sits inside the whole methodology, and it deserves its own section. Running cluster analysis requires auditing hundreds of findings with binary true-positive or false-positive verdicts. Most teams do that with another LLM. You now have an LLM judging whether another LLM’s findings are correct. If both models share a family, a prompt pattern, or a training distribution, they share biases, and the audit quietly validates the very failure modes it was supposed to catch.

Left: reviewer and grader run the same model family and share biases; correlated errors pass both gates. Right: N reviewers vote via multi-review aggregation; an independent grader with a different model evaluates the consensus output; a human spot-check validates a sample

Two lines of external work shaped how this methodology handles the recursive trust problem, and both are worth reading directly.

Multi-review aggregation. Zeng et al. introduced SWRBench, a benchmark of 1,000 manually verified pull requests for automated code review, and showed that running multiple LLM reviewers on the same PR and aggregating their findings boosts F1 scores by up to 43.67%. The intuition is direct: a finding is more likely real when independent reviewers converge on it, and noise tends not to correlate across runs. This is expensive (more tokens, more latency, more infrastructure), but it is the single most effective lever when false positives dominate. This approach maps cleanly onto the consensus-filtering idea that has become standard in other noisy ML settings.

Independent grading. Anthropic’s skill-creator toolkit separates the entity running a skill from the entity grading its output. In their architecture, an Executor runs the skill against eval prompts, an independent Grader evaluates those outputs against defined expectations, a Comparator does blind A/B comparisons between skill versions, and an Analyzer suggests targeted improvements. The key architectural move is the Executor/Grader split: without it, the grader inherits the executor’s biases, and “the skill passes its own tests” becomes a meaningless measure. When adopting cluster analysis in-house, point the audit LLM at a different model family, a different prompt style, and a different context window than your reviewer. That minimal separation catches more shared-bias errors than any prompt tuning will.

Aggregation and independent grading both impose cost. Running N reviewers per PR multiplies inference spend. Using a different model family for auditing means maintaining two integrations. For tools where trust is the binding constraint, that cost is usually worth it. Otherwise you get an AI reviewer that confidently grades itself as working while users see it the way they would see any tool crying wolf.

Two practical disciplines follow:

Never let the same model role-play both reviewer and grader. If both roles run the same model with similar instructions, shared failure modes go undetected. At minimum, change the system prompt. Better, change the model.
Use consensus for high-stakes findings, not for every finding. Multi-review aggregation is expensive, but its cost scales with what is at stake. Reserve it for findings that would block a merge or surface to a human reviewer. For low-stakes linting-style checks, a single reviewer is fine.

Neither approach eliminates the recursive trust problem. What they do is give you enough independence between audit layers that the failure modes you catch are real, and the ones you miss are statistically bounded by sampling rather than structurally hidden by shared bias. That is the foundation cluster analysis actually stands on.

Applying this to your business

Any team shipping LLM-powered review tooling can reuse this methodology directly, whether the domain is code review, contract review, SOC 2 audit assistance, clinical note review, customer support ticket triage, or any other review-adjacent workflow where precision is non-negotiable and developer-style trust-building applies.

Five takeaways across the series:

Build reviewers as Claude skills, not as monolithic prompts or vendor black boxes. Section-oriented markdown files give you the hooks for everything else. See Part 1.
Audit honestly on real inputs. Binary true-positive and false-positive verdicts with justifications, on real PRs from real users. Synthetic test cases will not surface the patterns that show up in production.
Cluster by root cause, not symptom. The fix for “flagged test code” is different from the fix for “hallucinated evidence,” even when both produce findings that look similarly wrong. Name the cluster, count it, and rank your backlog by volume. See Part 2.
Fix structure, not wording. When a model is ignoring an exclusion, making the exclusion longer will not help. Move it to where the model makes its decision. Convert passive rules into active gates.
Keep audit layers independent. Never let the same model play both reviewer and grader. Use consensus for high-stakes findings. Budget for the meta-cost.

False positives in AI review are not an unsolvable problem. They are an engineering problem with systematic solutions. Teams that invest in the methodology early, rather than shipping and hoping, build the trust that lets AI-assisted review actually live up to the pitch on the landing page.

← Part 1: Designing review skills · ← Part 2: Diagnosing false positives