the cost of accuracyyou price one pass. you pay for every pass it takes to be right.WHAT YOU PRICEDgenerationone passWHAT ACCURACY ACTUALLY COSTS123451generatefirst attempt, the pricedpass2verifycheck the claim orthe code3re-promptfix what the checkcaught4re-verifyconfirm the fix actuallyholds5human sign-offsomeone reads it beforeshipTHE VERIFICATION TAX · ANTHROPIC + OPENAI + GEMINI
· 6 min read ·

The Cost of Accuracy

dev ai claude cost agents verification evals

You can look up what a token costs, but not what a correct answer costs, and that gap is the line item almost nobody budgets for. It only widens as models get cheaper.

When a team prices frontier-model work, it prices generation: tokens in, tokens out, times volume. That math assumes every task is one-shottable, that the first answer is the answer, and almost nothing worth paying for behaves that way. You do not buy answers from a frontier model, you buy attempts, plus the passes it takes to learn which attempt was right. Call it the one-shot illusion: pricing assumes one pass, the work takes N, and the cost of accuracy is whatever happens in passes two through N.

I have spent the last three months inside evals that run tens of thousands of dollars a month in tokens, and the failure mode is always the same. The first output looks finished. A frontier model writes a design doc that reads beautifully and code that looks nearly perfect, and it holds right up until you read one layer deeper. The cost of getting AI right at scale is not in that first pass; it is in every pass that confirms the first one was actually right, and that is the line tech leaders have not started budgeting.

Token cost dropped, review time did not

Sonar’s State of Code survey of more than 1,100 developers, released January 2026, found 96% do not fully trust that AI-generated code is functionally correct, yet only 48% always check it before committing. The same survey, run by a company that sells code-quality tooling and so has a stake in the gap it measures, also found 38% say reviewing AI-generated code takes more effort than reviewing a human colleague’s. Generation got cheap; the checking did not, and the cost scales with the number of passes, not the price of any one of them.

The cleanest measurement of that hidden cost comes from METR’s randomized trial, which put 16 experienced open-source developers through 246 real tasks on mature repositories they had contributed to for years. With AI tools allowed, they took 19% longer. Afterward they estimated the tools had sped them up by about 20%. The cost was real and the perception of it ran in the opposite direction, which is what a hidden cost looks like when you graph it. Rishi Baldawa calls the effect the verification tax: a developer “spent more time fixing AI-generated code than they saved generating it, but that cost is diffuse enough that it doesn’t register as ‘AI made me slower.’”

It is not only code, it is facts

Writing this post, I ran a research pass to pull recent sentiment on AI cost. The first tool returned nothing, so a second pass through web sources surfaced a confident secondary claim: that METR had walked back its 19% slowdown finding in a February 2026 update. That would have rewritten a paragraph. Reading the primary source, METR’s own February 2026 post, showed the reverse. They kept the 19% result intact and are rebuilding a later experiment that selection bias had already spoiled, because the developers who benefit most from AI kept refusing the no-AI arm.

One generated claim, one verification pass, one averted error, and the pass that caught it cost more than the pass that produced it.

That is the same property that makes a single passing run meaningless on its own: a frontier model can return five different answers to the same question, so one correct-looking answer tells you nothing until you have checked it against something. Verification is not overhead you can optimize away. It is the price of using a stochastic system to produce something that has to be right.

Where the money actually goes

The per-pass cost compounds hardest inside agent loops. Each step resends the accumulated context, so by step 20 you have paid for the system prompt and the running history twenty times over. Tom’s Hardware reported on May 23, 2026 that agentic workflows can burn up to a thousand times the tokens of a single query in the worst case, depending on step count, and that the climbing bills have pushed some large engineering orgs to rein in usage. Not every pass is a human reading output. Many are automated: a test suite, a validator, a cheaper model checking the expensive one’s work, and running the verifier on a small model drops the per-pass cost by an order of magnitude. Prompt caching trims the repetition the same way, with cached input billed at a fraction of fresh input. Both lower the cost of each pass; neither lowers the number of passes, which is the term that dominates.

At the portfolio level, that unbudgeted pass count is what MIT’s NANDA research, reported by Fortune, measured when it found 95% of enterprise generative-AI pilots delivered no measurable impact on profit and loss. A pilot that demos in one pass and needs five in production did not fail because the model was weak. It failed because nobody priced passes two through five: the review time, the re-prompting, the human who reads every output before it ships.

The questions that actually set the budget are about counts, not rates. How many passes does it take for a model to validate its own success and get the facts completely right? How many models does a change run through before you trust it? How many rounds of human and AI review before you would bet the business on a major change? Those counts come into focus once you are past the honeymoon phase, when a beautiful-looking first output stops counting as done.

The honest unit cost of frontier-model work is generation plus every verification pass, divided by the outputs that cleared the bar you would hold a human’s work to. Most budgets quietly run that ratio at one pass, where the work looks cheap. Run it at the number a task actually needs, which is rarely one, and the figure changes enough to plan against. Generation price is not the variable you control, pass count is, and verification costs something even when the first answer happens to be right. Log the passes each task takes for a week before you price the next one; that number, not the per-token rate, is the bill.