Part 2: Diagnosing False Positives in AI Code Review | BDIGITAL

Part 2 of a three-part series on designing AI-powered automated PR review. Part 1 covered the anatomy of a Claude review skill. Part 3 covers the four fix patterns and the AI-auditing-AI problem.

Working code. The cluster-analysis methodology in this post applies directly to the skills in bdigital-public/samples/pr-review/. Every exclusion category, scope filter, and evidence-verification gate in those skills traces back to a real false-positive cluster from a prior audit.

Your new AI code review agent flags a hundred issues on every pull request. Seventy of them are wrong. After two sprints, developers click “dismiss” reflexively, and the tool that was supposed to catch real problems across every PR queue has quietly become a source of noise nobody reads.

Cry-wolf behavior kills AI review tools before they get a chance to earn their keep. A reviewer with 80% accuracy sounds good on a slide. On real pull requests where the cost of a false negative is low (a senior engineer will catch the bug in human review anyway) and the cost of triaging every finding is high, 20% false positives is right around the point where habituation takes over and the tool loses credibility with the humans it was built to help.

Tolerance varies by stakes. Security tools with catastrophic false negatives sustain much higher rates: Ami et al. (IEEE S&P 2024) report practitioners accepting 80% false positives in some SAST settings because missing a real vulnerability is worse than reviewing 4 wrong flags for every correct one. Developer-facing code review has the opposite economics, and Ponemon’s 2019 alert fatigue survey (popularized in Bitdefender’s summary) puts a concrete number on the cost: security teams spend roughly 25% of their time chasing false positives already, and that is a tool they cannot turn off.

Part 1 of this series framed the right way to build AI review tools: as composed Claude skills with named, section-oriented markdown files. This post is about what goes wrong once they are running in production, and a methodology for fixing it that any team building LLM-powered review can reuse directly. The techniques generalize well beyond code: legal documents, compliance findings, clinical notes, security alerts, any review-adjacent workflow where trust is the binding constraint.

The trust threshold is a step function

False positive rates do not degrade user trust linearly. They degrade it like a step function.

Step function showing user trust zones: investigate (0-10%), noisy but useful (10-30%), triage with suspicion (30-50%), dismiss by default (50%+)

In the audit this series draws on, developer behavior clustered into rough zones rather than sliding smoothly. Below roughly 10% false positives, developers treated every finding as real and investigated it. Between 10% and 30%, they investigated most findings but started labeling the tool “noisy” in retros and Slack channels. Above 30%, they triaged with suspicion. Above 50%, they dismissed by default, and a finding only got read if it blocked a merge.

For a sense of where current production tooling sits: SonarSource reports a 3.2% false positive rate across 137 million reviewed issues in 2025, after years of rule tuning. Untuned first-generation LLM reviewers typically start somewhere in the 40%–80% range. The methodology in this post is what closes that gap.

False positive rates across published tooling: SonarQube 3.2%, Semgrep Code roughly 12%, untuned LLM reviewers 40 to 80 percent, with the step-function trust zones overlaid

Once a tool crosses into the dismiss-by-default zone, accuracy improvements no longer recover trust on their own. Users have already learned the pattern and built habits around it. Climbing back into the “worth reading” zone requires both a large accuracy jump and a visible signal to the team that something changed. This matches the habituation literature on security warnings: Anderson et al. (MIS Quarterly 2018) tracked fMRI response to repeated warnings and found attention drops sharply after the first few exposures; visibly-changed, polymorphic warnings restored adherence from 55% to 80%.

Signaling the change is critical for business context. Shipping a quiet 40% accuracy improvement to an already-ignored tool will not be enough. You need the jump and a relaunch.

The methodology: false positive cluster analysis

False positives are not random noise. They cluster around a small number of root cause patterns. Find the clusters, fix the root causes, and accuracy improves dramatically. Four steps:

Four-step cluster analysis pipeline: collect audit data, extract every false positive, cluster by root cause, propose structural fixes

Step 1: Collect real audit data

Run your reviewer against a diverse corpus of real inputs. Synthetic test cases fail to surface the patterns that surface in production because they lack the messy context of production data. For a code review tool, that means real pull requests from real repositories across multiple teams. For a contract review tool, real contracts. Aim for dozens of inputs minimum, producing at least several hundred findings.

Every finding should be audited with a binary verdict, true positive or false positive, plus a short justification that explains precisely why a false positive is wrong. That justification is the data you actually need. Without it, you are clustering by symptom rather than cause, and your fixes will treat the wrong thing.

Automation helps here. A separate LLM evaluator with access to the full context of each finding can handle most of the audit volume, leaving humans to spot-check and validate the evaluator’s judgment against a sample. Budget for several hundred audited findings before you start analyzing. Part 3 returns to this audit step and examines the meta-problem it introduces: using an LLM to judge another LLM is itself a source of systematic error, and the techniques for mitigating that bias are the foundation the rest of this methodology stands on.

Step 2: Read every false positive

For each reviewer in your system, pull every false positive with its audit justification. Read them all. This is tedious when a single reviewer has produced over a hundred false positives. Shortcuts here produce shallow analysis and brittle fixes.

As you read, you are looking for the shape of each mistake. Not “the model flagged missing error handling” but “the model flagged missing error handling in test infrastructure where it does not apply.” Symptom versus pattern.

Step 3: Cluster by root cause, not by symptom

Group false positives by why the model got it wrong, not by what it flagged. Two findings might both flag “missing validation,” but one is wrong because the code path is already covered by an upstream handler, and the other is wrong because the file is a build script, not application code. Those are different root causes requiring different fixes.

Name each cluster, count its members, and pick two or three concrete examples that illustrate the pattern clearly. A good cluster name describes the failure mode in one short sentence. “Model flags test infrastructure as if it were production code” is a useful cluster. “Miscellaneous false positives” is not.

Step 4: Propose structural fixes

For each cluster, diagnose why the model failed and propose a specific, testable change to the review skill. Not “make it better.” A concrete structural change targeting a specific named section of the skill markdown: adding a path-based early exit to the scope-filter section, moving a verification step from the exclusion list into the detection-rules section itself, inserting a mandatory checklist before findings can emit.

Project the accuracy improvement if the fix eliminates its cluster. Now you have a prioritized backlog ranked by false positive volume eliminated. Fix the cluster that represents 40% of your noise first. Then fix the next one.

In the audit this series draws on, a single round across seven review skills and several hundred findings took roughly a day of focused work. Output: a ranked engineering backlog grounded in evidence, not a list of vibes. The four specific fix patterns that emerge from this process are the subject of Part 3.

Three failure modes you will almost certainly find

Running cluster analysis across multiple review systems surfaces a small set of recurring patterns that appear independently across different teams, domains, and model providers. Three of them are worth expecting before you start.

Percentages below come from one portfolio of seven Claude review skills audited against several hundred real pull requests. Treat them as indicative shape, not universal law. Your reviewers may cluster differently, and that is itself useful signal.

Three failure modes with typical prevalence: ignores its own exclusion list (~40%), fabricates evidence (~15%), treats config as production code (30-42%)

Failure mode one: the model ignores its own exclusion lists

Exclusion-list failures dominated the portfolio at close to 40% of false positives. Every skill has a “do not flag” section listing patterns to skip: test infrastructure, configuration files, cosmetic changes, framework wiring. A model will get these right when tested in isolation, then consistently fail to apply them during real reviews.

Root cause: exclusion lists are typically written as flat bullet lists with fifteen to twenty items, buried at the bottom of a longer prompt. A model’s strong instinct to “find something wrong” overwhelms the suppression rule. When a detection pattern matches, the positive signal from “this looks like a problem” drowns out a negative signal it encountered several paragraphs earlier.

A reviewer was explicitly told not to flag coverage exclusions for directories containing no application source code. It then flagged a coverage configuration that excluded a directory containing forty YAML files and zero application code. The reviewer recognized the rule existed and hedged its finding with “if this directory contains application source code,” but never actually checked whether the directory contained application source code before emitting the finding.

Failure mode two: hallucination and fabricated evidence

Smaller in volume but disproportionately damaging. Roughly 15% of false positives in the portfolio came from models asserting things that are provably false.

Four sub-patterns recur:

Fabricated identifiers. Citing a specific commit hash, ticket number, or line reference that does not exist in the target system.
Inflated metrics. Claiming a file was modified “32 times in three months” when the real count is 1.
Wrong system identity. Asserting that a file under review “belongs to a different project” because the model has conflated the review tool’s own source repository with the one being reviewed. Context contamination from the tool’s internals leaking into findings about external targets.
Nonexistent references. Claiming a file, import, or function is missing when it exists exactly where it should.

Hallucination failures are rare relative to exclusion-ignore errors, but they carry outsized damage. A developer who catches a reviewer citing a nonexistent commit will never trust that reviewer again, regardless of how good its other findings are.

Failure mode three: configuration treated as production code

Review agents designed to catch issues in production application code tend to misfire badly when handed environment configuration, infrastructure-as-code, or developer tooling. A model sees patterns that look concerning (a security flag disabled, a rollout step skipped, missing error handling in a script) and emits findings without checking whether the file is actually production code or whether the change is an intentional operational decision documented in the commit message.

Across three reviewers in the portfolio, this pattern produced between 30% and 42% of total false positives. Configuration files and test infrastructure were the two biggest individual contributors.

What the data actually looks like

Cluster analysis produces humbling numbers on the first honest run. Reviewers that felt “mostly okay” in manual spot checks turn out to sit at 19%–56% accuracy when audited against hundreds of real findings. Five hundred-plus false positives across a handful of reviewers is not an unusual starting point.

Humbling is not a reason to give up on the approach. It is the entry-point measurement that the rest of the methodology gets to work on. The cluster analysis produces a ranked engineering backlog. The fix patterns in Part 3 give you the toolkit. And the meta-problem of using AI to audit AI, which also gets covered in Part 3, explains why these early numbers should be taken as lower bounds for how much improvement is possible.

Between the diagnosis and the fix, the most important thing the methodology does is reframe the problem. “Our AI reviewer is noisy” is not actionable; “this reviewer has 48 false positives, 20 of them clustering into one root cause (ignored exclusions for infrastructure paths), with a named edit to the scope-filter section as the fix” is.

What is next

Part 3 covers the four structural fix patterns that emerged across hundreds of audited findings. Each pattern maps directly onto the named sections of a review skill described in Part 1: pre-flag verification checklists into detection rules, structured exclusion categories into the exclusion list, evidence verification gates, and path-based early exits in the scope filter. It also covers the meta-lesson that explains why almost every skill already contains the correct rules but fails to apply them (instruction salience), and the recursive trust problem of running audits with LLMs themselves, including the two external bodies of work (multi-review aggregation and independent grading) that make the whole methodology defensible.

← Back to Part 1 · Continue to Part 3 →