when 'working' isn't repeatablesame prompt, same model, five different answers.INPUT (×5, identical)OUTPUT (×5, divergent)run 1$ claude -p "verify the technical claims in this PR"PASSall claims verifiedtemperature = 0run 2$ claude -p "verify the technical claims in this PR"FAILclaim 3 unsupportedtemperature = 0run 3$ claude -p "verify the technical claims in this PR"UNSUREcannot verify claim 2temperature = 0run 4$ claude -p "verify the technical claims in this PR"PASSall claims verifiedtemperature = 0run 5$ claude -p "verify the technical claims in this PR"FAILclaim 2 contradicts sourcetemperature = 0
· 11 min read ·

When 'Working' Isn't Repeatable

dev ai claude evals stochasticity

Run the same Claude skill against the same pull request five times, and you can get five different verdicts on whether the technical claims are accurate, with the prompt, machine, and model held constant. Run a more complex review flow against the same diff, twice in a row, and the findings will not agree on what the issues are. Ask a frontier model in a headless claude -p session to confirm a specific implementation detail, run the prompt three times, and one of the three answers will flatly contradict the other two.

At which point you stop and ask: how do I trust any of this? If a system I am using as a quality gate cannot give me the same answer twice, in what sense is the gate working?

Frontier LLMs are not deterministic. Anthropic’s API reference says so in plain English: “Note that even with temperature of 0.0, the results will not be fully deterministic.” OpenAI’s seed parameter docs make a near-identical admission, with extra hedging about backend fingerprints. Engineers’ mental model from regular software, where the same input produces the same output every time, is wrong, and it is wrong in a way that breaks every assumption about what tested, reviewed, or verified mean.

This post covers why it happens, why it compounds when you stack LLMs into your tooling, and how to recover trust. Trust is recoverable. It is not free, and lowering temperature does not fix it.

Three layers of trust collapse, in escalating order

Those three failures look parallel at first. They are not. They sit on a ladder, and each rung weakens the ground beneath the next.

Three-layer ladder showing escalating trust collapse: Layer 1 measurement instrument noisy (the accuracy-check skill), Layer 2 review tool noisy (the PR review flow), Layer 3 source of truth noisy (frontier model as oracle), with each layer compounding the variance of the layer below

Layer 1: Your measurement instrument is noisy

A skill that runs at PR time to evaluate technical accuracy is, by job description, a measurement device. You point it at a claim, it returns a verdict. When that device returns five different verdicts on the same claim across five identical runs, its measurements stop being trustworthy on a single read. You have a noisy instrument.

Noisy instruments are a known problem in regular engineering. Your CI flake rate is a noisy instrument. Load-test percentile measurements are noisy. Engineers have learned to handle them by averaging across multiple runs and reporting a confidence interval rather than a single number. Same discipline applies here, with one twist: in regular engineering you usually know roughly how noisy your instrument is. With LLMs above temperature zero, you often do not, until you measure.

Layer 2: Your review tool is noisy

That second failure sits one rung up. A review flow is not measuring a single claim, it is making decisions about a body of code, and those decisions feed into other decisions downstream. A reviewer that flags a critical issue on Tuesday and misses it on Wednesday is not just noisy, it is non-monotonic. Which issues you have been told about depends on which run you read.

Compare to a static analyzer. SonarQube on the same code, same rules, gives the same findings every time. That stability is what makes its output composable: you can grep its history, diff findings between commits, and treat the absence of an issue as evidence the issue is not there. An LLM reviewer breaks all three of those affordances at once.

Layer 3: Your source of truth is noisy

Layer three is the most insidious. When you run claude -p to confirm a technical detail of an implementation, you are using the model as an oracle. You are asking, in effect, “Is this correct?” and treating the answer as ground truth.

If the oracle is non-deterministic, ground truth is non-deterministic. Two engineers who consult the same model with the same prompt about the same implementation can walk away with contradictory beliefs about what the system does. That is not a measurement problem, it is an epistemology problem.

Each layer compounds the one below. A noisy instrument inside a noisy review flow whose findings are confirmed by a noisy oracle is not three independent random variables, it is a stacked variance amplifier. Trust does not just degrade. It collapses.

Why temperature=0 doesn’t save you

Most engineers who hit this for the first time set temperature=0 and assume the problem is gone. It is not gone. Vendor docs are blunt about this if you read them. Anthropic’s API reference, on the temperature parameter: “Note that even with temperature of 0.0, the results will not be fully deterministic.” OpenAI’s seed parameter docs say the system “will make a best effort to sample deterministically” and that “determinism is not guaranteed.”

Two vendors, two admissions in their own production documentation. Engineers’ mental model that “temperature controls randomness, so temperature=0 means no randomness” is incorrect. Temperature controls the sampling distribution at the token level. It does not control the underlying inference computation.

So where does the residual non-determinism come from? Until recently the canonical public answer pointed at sparse mixture-of-experts architectures and batched inference. A 2023 explainer, “Non-Determinism in GPT-4 is caused by Sparse MoE” by 152334H, made the case empirically: GPT-4 produced 11.67 unique outputs per 30 attempts at temperature 0, far above other models tested in the same study. Here is the mechanism: when expert-routing decisions in a MoE model depend on what other tokens are in the same batch, your output depends on whose traffic happens to be sharing your GPU at that moment.

That explanation held for two years and shaped most engineering folklore on the subject. In September 2025, Horace He and Thinking Machines Lab published a sharper account. MoE routing is a contributor, but the dominant cause across both MoE and dense models is something simpler: batch-size variance. Inference servers batch incoming requests opportunistically based on load. Different batch sizes cause GPU kernels (matrix multiplication, attention, normalization) to use different reduction orders. Floating-point addition is not associative, so different reduction orders produce different numerical results. Same input, same model, different load on the cluster at the moment your request arrives, all yield different output.

Read that again. Your output depends not just on your input, but on how busy the inference cluster was when your request landed.

Horace He’s team also showed this is fixable: their batch-invariant kernels eliminate the variance, at the cost of some throughput. As of writing, no major frontier-model API ships this fix in production. You are paying for stochastic outputs whether you want them or not.

What this rules out

A few common workarounds become less compelling once you understand the mechanism:

  • Setting temperature=0 does not address GPU-level reduction order. It addresses sampling, which is downstream of the variance source.
  • Setting a seed (where the API supports it) constrains the sampler, not the kernel. OpenAI explicitly warns that even matching system_fingerprint and seed leaves “a small chance that responses differ … due to the inherent non-determinism of our models.”
  • Self-hosting the model improves your control over the kernel stack and batching policy, but most people who self-host still run a stock vLLM or TGI build, which has the same batch-size sensitivity Horace He’s piece describes.

Determinism is achievable in 2025. It is just not the default any vendor ships, and it is not the default the open-source serving stacks ship either.

How to recover trust: three disciplines

Engineers know how to ship reliable systems on top of unreliable components. We do it for distributed systems, for network calls, for hardware. Same toolkit applies here, transposed onto LLMs. Three named disciplines, in the order they should land in your stack.

Evals are regression tests with N runs and confidence intervals

Most teams running evals on LLM-based skills run each eval once. One pass tells you nothing. Output you got is a single sample from a distribution, and a single sample cannot distinguish “this prompt is now better” from “this prompt is the same and you got a high-tail draw.”

Run every eval N times. Five at minimum, ten if you can afford it. Report the pass rate as mean ± stddev, not as a single percentage. Now the interesting question becomes: did the new prompt move the mean by more than two standard deviations of the baseline noise?

Worked example, names elided. A prompt change moved an eval pass rate from 78% to 86% on a single run. Looks like an 8-point improvement. Run it ten times each: baseline at 80% ± 7pp, candidate at 84% ± 6pp. Distributions overlap heavily. Your “improvement” is inside the noise floor; you have no evidence the change did anything.

One pass tells you nothing. Ten passes tell you a distribution. Your distribution is the eval; the single number is theater.

Benchmarking means baselining variance before measuring change

You cannot say a change improved a system if you do not know what the system’s variance looked like before. This is statistical process control applied to prompts and skills. Before tuning anything, run the existing skill against your eval set ten times and compute the standard deviation per metric. That is your noise floor.

Now you have a yardstick. Changes that move metrics by less than two standard deviations of the baseline are inside the noise; you cannot claim they are improvements. Changes that move metrics by more than two standard deviations are evidence of real signal, worth investigating further.

Most teams skip this step and go straight to A/B-style prompt tuning. Without a baseline, you are flipping coins and counting heads. Audit work in Part 2 of the AI code review series on this site only produced credible accuracy numbers because every reviewer was run against the same several-hundred-finding corpus before and after each change, with the variance band reported alongside the headline number.

Structured prompts are variance reduction

Free-text outputs have the largest distribution of possible answers. A prompt that asks “is this implementation correct?” returns a paragraph that could legitimately be phrased a hundred ways, half of them subtly contradictory. Output variance reflects input freedom.

Structured outputs collapse the distribution. A prompt that demands a JSON schema with {"correct": boolean, "reasons": string[]} cuts the answer space dramatically. Tighter schemas cut it further. Few-shot examples narrow it more. Explicit refusal paths (“if you cannot determine, return null”) prevent confabulation in the long tail of the distribution.

You are not eliminating stochasticity at the kernel level; that fight is Horace He’s, not yours. You are shrinking the support of the output distribution. Same underlying noise produces a far smaller spread of human-meaningful answers when the answer space itself is constrained.

Combined: an eval suite that runs N times against structured outputs with a known baseline variance is what trust looks like in this regime. Each piece on its own is insufficient. Together they recover the affordances regular software gives you for free.

The shift

Stop asking “is the LLM right?” Start asking “what is the distribution of answers from this LLM, and is that distribution acceptable for this use case?”

A reviewer that catches 95% of real issues with a 10% false positive rate, drawn from a distribution that varies by 3 percentage points across runs, is a tool you can ship. That same reviewer described as “85% accurate” with no variance reported is a slide. One is engineering, the other is marketing.

Distributed systems engineers know this in their bones. Your service-level objective is not “the request will succeed.” Your SLO is “p99 latency stays under 200ms across a measurement window of one hour.” Same shift here. Stop reasoning about LLMs as if their outputs were values; reason about them as samples from a distribution, and design your gates, your tests, and your SLOs accordingly.

Trust is recoverable. It costs ten runs where you used to do one, structured outputs where you used to use freeform, and a baseline variance number where you used to use vibes. All in, that bill is small compared to the cost of a tool nobody trusts because they have already learned not to.