The Harness Is the BottleneckFive prompts that broke on Opus 4.7review.md4.6 era# code review promptReview this PR.Be conservative,only flag high-severityissues. Think step bystep before responding.review.md4.7 era# code review promptReview this PR.Report all findings.Mark severity:critical, high, medium,low, nit. Human filters.literalnessthe model didn't get worse · the contract changed
· 19 min read ·

The Harness Is the Bottleneck: Five Prompts That Broke on Opus 4.7

ai claude prompt-engineering evals

Anthropic’s prompting documentation ships a warning that almost nobody has read: code review prompts that say “be conservative” or “only flag high-severity issues” may cause Opus 4.7 to silently drop real bugs. Investigation continues as carefully as on Opus 4.5. Bug gets identified. Then the report omits it, because we told the model to be conservative.

That is one of five prompt patterns that broke between Opus 4.6 and Opus 4.7, and Anthropic ships the warning for each one inside docs that almost no migration toolchain surfaces. None of them get caught by pip install -U anthropic. None of them produce a 400 error. Most produce silently worse output that looks the same as what you had before.

This post names the five, sources each to the canonical Anthropic doc that warns about it, and closes with a Monday-morning grep audit you can run against your own codebase. It is a workaround manual for a contract change that did not ship with migration tooling.

The Two Contracts

Prompting on Opus 4.7 is no longer a single instruction-following contract. It is two contracts:

  • A budget contract: the effort parameter decides how hard the model tries before answering. xhigh is the recommended default for coding. max is “the trap” (more on that in Mistake 5).
  • A literalness contract: the model executes what you said, not what you meant. Per Anthropic’s launch announcement, “Where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.”

Both contracts changed in April 2026. Half your prompt library is signed against the old terms.

Anthropomorphism is the wrong frame here. Opus 4.7’s post-training shifted instruction-following weights toward narrow execution. Where 4.6 generalized from one example to a class of similar items, 4.7 stops at the example. Where 4.6 inferred unstated requests from context, 4.7 asks. None of this is the model “being more careful” or “being more obedient.” Those are the wrong abstractions. A different post-training distribution scores narrow execution higher than loose interpretation, and the resulting policy is what you are now prompting against.

2x2 grid: instruction interpretation (loose↔literal) × thinking mode (forced↔adaptive), with Opus 4.5/4.6/4.7 placed along a diagonal toward "fully adaptive defaults."

Karpathy’s “Software 2.0” essay (2017) traced the arc from “what code does” to “what code learns.” Adaptive thinking and the literalness shift are the next step on the same arc: not just what the model does, but how hard it tries and exactly what it executes. Your prompt is no longer instruction. It is a multi-dimensional contract, and migrating between models means re-signing it.

Boris Cherny, who leads Claude Code at Anthropic, posted on launch day that “it took a few days for me to learn how to work with it effectively” (X, 2026-04-16). If the lead engineer of Claude Code spent days recalibrating, the rest of us are not going to coast on 4.6 muscle memory. Your prompt library from 2025 deserves a closer look than the changelog suggests.

Mistake 1: The Conservative-Prompt Trap

This is the failure mode the post opens with, and it is the one that makes the literalness contract concrete. From Anthropic’s prompting best practices doc:

When a review prompt says things like “only report high-severity issues,” “be conservative,” or “don’t nitpick,” Claude Opus 4.7 may follow that instruction more faithfully than earlier models did — it may investigate the code just as thoroughly, identify the bugs, and then not report findings it judges to be below your stated bar.

Mechanism: the literalness contract from the previous section. Opus 4.5’s review pass would silently bump moderate-severity bugs into the “high-severity” bucket because the surrounding instruction said “find the important problems.” Opus 4.7’s pass partitions strictly: it finds everything, then applies your stated filter, then reports only what survives the filter. Investigation cost was paid. Output was discarded.

Catch-rate regressions like this never show up in your CI dashboard. Count of findings stays roughly constant. Latency stays roughly constant. Cost stays roughly constant. What drops is the catch rate on bugs sitting just below your stated severity bar, which is exactly where the bugs you actually want to find tend to live.

Eval methodology

Treat the headline as a falsifiable hypothesis, not a confirmed regression. Here is the eval to run on your own codebase:

  1. Sample 10-20 known-bug diffs. Use real PRs from open-source projects with CVE-tagged commits, or your own historical PRs where the post-merge fix landed within a week.
  2. Run identical review prompts against claude-opus-4-6 and claude-opus-4-7, both at effort: "xhigh" (the new recommended coding default).
  3. Run again with restraint language stripped. Remove every instance of “be conservative,” “only flag high-severity,” “skip nitpicks,” “don’t be pedantic.” Keep everything else identical.
  4. Score on three axes: catch rate (did the review name the actual bug), false positive rate, severity grading. For severity, name your grader explicitly. If you use a calibrated GPT-5 or Sonnet-4.6-as-judge, validate the grader against a held-out human-labeled set before trusting its grades. LLM-as-judge needs its own eval, and “I asked Claude to grade Claude” without that step is the same vibes problem the post is arguing against.

I have not yet run this eval at scale. Opening anecdote in the hook is what Anthropic’s docs warn about, not a controlled measurement I am asserting. Methodology above is what I am running next; results and a companion repo land in part 2 of this post by 2026-06-10. No companion repo URL yet, by design: I won’t link a repo that does not exist. Until then, treat the headline mistake as a falsifiable hypothesis: run the eval on your own codebase before changing your review prompts.

Fix

Audit every review or triage prompt for restraint language. Rewrite as “Report all findings. Mark severity (critical, high, medium, low, nit). Let the human filter.” Post-trained policy now treats restraint instructions as a stricter filter on what to surface; move the filter from the prompt to the human reading the output.

Eval for your codebase: before adopting the rewrite, run 20 production review prompts pre/post and compare bugs-caught at moderate severity. If the rewrite catches things the conservative version missed, the trap was real for you.

Mistake 2: The Scaffolding Tax

Every prompt that says “think step by step” or “plan before acting” or “double-check your work” was scaffolding around a reasoning gap that earlier Claude models had. Adaptive thinking at the high/xhigh defaults does the planning, the chain-of-thought, and the verification natively. Anthropic’s docs say “at the default effort level (high), Claude almost always thinks” and continues to think on tasks the model judges substantive. Your scaffolding pays for the same reasoning twice: once in compelled chain-of-thought tokens that satisfy your literal instruction, and once in the model’s own thinking blocks that did the actual reasoning. Scaffolding tax is real output-token cost on work the model would have done anyway.

A narrower variant of the same advice appears in Anthropic’s prompting best practices: “If you’ve added scaffolding to force interim status messages (‘After every 3 tool calls, summarize progress’), try removing it.” That doc only flags the interim-status case; extending it to chain-of-thought scaffolds is my generalization, not Anthropic’s. Mechanism is the same in both: adaptive thinking subsumes the work the scaffold was paying for.

EraPromptOutput compositionOutput-token cost
Opus 4.5 with "think step by step"scaffolding presentcompelled chain-of-thought + final answer~1.7× baseline
Opus 4.7 at xhigh effortno scaffoldingnative adaptive thinking blocks + final answer~1.0× baseline

Same model output quality. Same workload. Different prompt, different bill. Difference is the tax.

Fix

Delete the scaffolding. Raise effort one notch (high to xhigh, or xhigh to max if the workload genuinely earns it). Re-baseline against your eval. Default state on Opus 4.7 is “the model does the planning”; you only re-add scaffolding if the eval shows it earned its place back.

Eval for your codebase: pick 10 representative tasks. Run identical at xhigh effort with and without “think step by step.” Measure output quality and total output_tokens. If quality holds and cost drops, scaffolding was the tax.

Mistake 3: Silent Omission

On Opus 4.7, thinking.display defaults to "omitted" instead of "summarized". Streaming UIs that relied on summarized thinking blocks for progress indication now hang on a long blank pause before text starts. Thinking blocks still emit, but their thinking field is empty. Your reasoning panel goes dark. Then text streams. From the user’s seat, the model “thought for ten seconds about nothing” and then produced an answer.

A live GitHub issue on the claude-code repo shows users tripping over this immediately after the migration: the SDK changes the default with no SDK warning, the docs note the change in a paragraph that is easy to miss, and the symptoms look like a streaming bug rather than a config change.

This is a defaults problem more than a user error. Anthropic chose the new default to optimize time-to-first-text-token; that choice was the right call for batch workloads where reasoning is internal, and a breaking change for chat workloads where reasoning is the UX. Fix lives in your config, but the root cause lives in the SDK defaults, and the right migration tooling would auto-detect streaming and pick summarized.

Either way, you are charged for the full thinking tokens whether you display them or not. display: "omitted" is a latency optimization, not a cost optimization.

Fix

If you stream reasoning to a user-facing UI, set thinking.display to "summarized" explicitly:

thinking = {
"type": "adaptive",
"display": "summarized",
}

If your application doesn’t surface reasoning to anyone, leave display at the new default; the faster time-to-first-text-token is real.

Eval for your codebase: stream a 1k-token reasoning task with the new default and with display: "summarized". Measure time-to-first-text-token and time-to-first-thinking-token. If your UX shows progress to users, the second config is what they will see.

Mistake 4: Re-Engagement Cost

Chat patterns that worked on Claude 3.5 Sonnet are now the most expensive prompt pattern on Opus 4.7: “do X” → result → “now also Y” → result → “actually also Z.” Each new user turn invites adaptive thinking to re-engage and generate fresh thinking tokens, billed at the output rate. A 10-turn drip-feed conversation generates roughly 10x as much thinking output as one fully-specified single turn at the same effort level.

Get the mechanism right because the obvious story is wrong. Prior-turn thinking blocks are not re-executed. They are preserved in the input context (cached after the first occurrence, billed at the input cache-read rate). What costs money is new thinking the model generates for each new turn, because the literalness contract from the second section means the model treats each new user request as its own scoped task and re-derives the reasoning from scratch.

Anthropic’s best-practices post for Opus 4.7 with Claude Code puts the workflow shift directly:

Specify the task up front, in the first turn. Well-specified task descriptions that incorporate intent, constraints, acceptance criteria, and relevant file locations give Opus 4.7 the context it needs to deliver stronger outputs. Ambiguous prompts conveyed progressively across many turns tend to reduce both token efficiency and, sometimes, overall quality.

That advice does not survive contact with conversational UX. Real human-in-the-loop workflows cannot always specify everything upfront because the human is figuring out what they want as they go. Real agentic loops cannot always specify everything upfront because the next step depends on the result of the previous one. Honestly: when you can spec upfront, do; when you cannot, switch to an explicit turn-budget pattern with phase resets between phases. Reset the conversation between distinct phases of work. Treat each phase as its own first-turn-with-full-spec.

PatternTurnsFresh thinking generatedBilled where
Drip-feed: “do X” → “now Y” → “now Z” → …10~9,000 tokensoutput rate, every turn
Full-spec single turn with intent + constraints + acceptance criteria1~1,200 tokensoutput rate, once

Prior thinking is preserved as cached input either way. Cost difference is the fresh output thinking the model generates each turn, because the literalness contract makes it re-derive its scoping from scratch.

Fix

For workloads you can spec upfront: write the full task in turn 1, with intent, constraints, acceptance criteria, and file locations. For workloads you cannot: define explicit phases, reset the conversation between phases, and treat each phase’s first turn as the full spec for that phase.

Eval for your codebase: pick a 5-step task you would normally drip-feed. Run as 5 separate turns and as one fully-spec’d turn. Compare total output_tokens (mostly thinking) and total wall-clock. That delta is your re-engagement cost.

Mistake 5: The Max Effort Trap

Two opposite mistakes hit the same workload from different directions. First, leaving the API at its implicit high default while running coding workloads, where Anthropic explicitly recommends xhigh. Second, jamming max on every call (overthinks, costs more, sometimes produces worse output). Towards AI’s empirical 12-coding-problem effort sweep calls max “a trap” because of diminishing returns past xhigh and overthinking risk on bounded problems where additional reasoning produces second-guessing rather than insight.

Anthropic’s effort docs state plainly: “The API default is high. To use xhigh, set effort explicitly; the value you pass overrides the default.” Recommended starting point for Opus 4.7 is xhigh for coding and agentic work, high minimum for intelligence-sensitive workloads, and max only for genuinely frontier problems where evals show measurable headroom. Boris Cherny has confirmed the same operational default in his launch-day Opus 4.7 thread: xhigh for most tasks, max reserved for the hardest.

Two operational details migration-day teams keep tripping over. First, effort settings persist within a session: change once, stays until next reset. Second, like Silent Omission, this is partly a defaults problem. Implicit high default is reasonable for chat-style workloads but underspends on agentic coding where Anthropic’s own guidance is one notch up. Fix lives in your config, but the cause is a default tuned without knowing which workload you are actually running.

WorkloadEffortWhy
Chat / Q&AmediumDrop from default high; skip thinking on simple lookups; cuts cost
Code generationxhighNative planning, reasoning, verification; cheaper than scaffolding
Agentic loopsxhighAdaptive thinking enables interleaved reasoning between tool calls
Hard math / proofsmaxEarned only when evals show measurable headroom past xhigh
Routine triagelow-mediumStricter scoping; faster; explicit step-down from default

Fix

Default to xhigh for coding and agentic workloads. Escalate to max selectively, only on workloads where you have measured improvement.

Eval for your codebase: run your top-10 production prompts at high, xhigh, and max. Compare task-success grading and median cost. Where xhighmax on success and max costs 1.5-3x more, max is the trap for that workload.

Mistake 6: You Shipped Without a Regression Eval

Five mistakes worth of behavioral shifts, and most teams migrated by changing one config line and shipping. Most common production failure on model migrations is not any single behavior change. Absence of a regression eval is what would have caught whichever of mistakes 1-5 are real for your codebase.

A useful regression eval is small and cheap. Thirty to one hundred graded examples sampled from your production traffic, scored on three axes: task-success rate (did the model do the thing), median cost per call (in dollars, not tokens, so the tokenizer change in the next section becomes visible), and median latency. Run pre-migration and post-migration. If any of mistakes 1-5 are real for your prompt library, the eval will surface them as regressions on at least one of the three axes. If none are real, you have shipped the model upgrade with evidence.

Without the eval, you are trusting vibes, including the vibes in this post. Every recommendation here is testable on your codebase, and every team should test before adopting. Eugene Yan’s “Patterns for Building LLM-based Systems & Products” and Hamel Husain’s “Your AI Product Needs Evals” are the two canonical references on what this should look like in practice. Both predate Opus 4.7 by a year, and both apply unchanged.

Opus 4.7 lists at the same per-MTok price as Opus 4.6. Three things behind the price tag changed:

  • Tokenizer change. Per Cloudzero’s pricing analysis, the new tokenizer can use up to 1.35x as many tokens per character on the same text. Code-heavy and structured-data workloads cluster at the high end. Per-MTok price is unchanged; per-task cost is up by whatever your tokenizer-change ratio turns out to be.
  • Image tokens. Per the migration guide, the per-image cap moved from roughly 1,600 tokens to roughly 4,784 tokens. Workloads that pass full-resolution screenshots or PDFs straight into the model see an immediate 2-3x cost increase per image with no client-side change.
  • Banned sampling parameters. temperature, top_p, and top_k now return a 400 error. Anthropic’s position is that temperature = 0 never actually guaranteed determinism on prior models, so the parameter is gone rather than deprecated. Existing migration scripts that pass these values silently regress to errors on the first call.

Manual extended thinking’s budget_tokens parameter also returns a 400 error on Opus 4.7; adaptive thinking is the only supported mode and the migration guide covers the swap. Anyone who has migrated has hit this; covered as command #1 in the harness audit below.

Budget-and-literalness as a contract shift is not unique to Anthropic. Gemini 2.5’s thinking_budget parameter and OpenAI’s reasoning_effort represent the same architectural shift toward adaptive reasoning across model families. Five mistakes in this post translate to either provider with different parameter names; the shape of the migration is the same. If your stack runs Claude alongside Gemini or GPT, you are running the same audit, and your prompt-library cleanup is the same cleanup.

The Harness Audit (Monday Morning)

Six grep commands to find every place in your codebase that hits one of the five mistakes:

Terminal window
# 1. Find every place still passing budget_tokens (now a 400)
git grep -nE 'budget_tokens|"enabled".*budget' --
# 2. Find every "be conservative" / "only high severity" restraint prompt
git grep -niE '(be conservative|only.*high|skip.*nitpick|don.{1,2}t.*nitpick)' --
# 3. Find reasoning scaffolding now redundant under adaptive thinking
git grep -niE '(think step by step|plan before|double[- ]check|chain of thought)' --
# 4. Find adaptive-thinking call sites missing display: summarized
git grep -lE 'thinking.*adaptive' | while read f; do
if ! grep -qE '"summarized"|display.*summarized' "$f"; then
echo "$f: adaptive thinking without explicit display:summarized"
fi
done
# 5. Find banned sampling params still set
git grep -nE '(temperature|top_p|top_k)\s*[:=]' -- '*.py' '*.ts' '*.json'
# 6. Audit cache hit ratio before/after migration
# (run against your usage logs, not the codebase)
jq '[.cache_creation_input_tokens, .cache_read_input_tokens] | add' usage-logs/*.jsonl |
awk '{sum+=$1} END {print "total cached:", sum}'
# Compare 30 days pre-migration vs 30 days post.
# A drop in the cache-read ratio means the tokenizer change shifted your cache breakpoints.

The New Defaults

Tight checklist for sane Opus 4.7 defaults across your prompt library:

  • Set thinking: {type: "adaptive"} explicitly (silent omission of the field gives no thinking at all)
  • Set effort: "xhigh" as your default for coding and agentic workloads
  • Set display: "summarized" if your UX surfaces reasoning to users
  • Specify the full task in turn 1 (intent, constraints, acceptance criteria, file locations)
  • Delete restraint language from review and triage prompts
  • Delete reasoning scaffolding (“think step by step,” “plan before acting,” “double-check”)
  • Build verification into the agentic loop instead of asking the model to verify itself
  • Drop temperature, top_p, top_k from your client (they 400 now)
  • Audit cache breakpoints in your usage logs before and after migration
  • Ship a regression eval before the migration; not after

What Anthropic Should Ship Next

What follows is one author’s wishlist, not panel consensus. Anthropic shipped Opus 4.7 in a defensible state. Hard-erroring on budget_tokens instead of warning may be the right call when the parameter has no semantic meaning under adaptive thinking, since a deprecation warning silently runs the wrong mode for one release while the user thinks they have time to migrate. Tradeoffs are not free. But the migration surface is large enough that some of these would land:

  • A claude migrate <path> command that reads a codebase, surfaces the five mistakes from this post, and suggests rewrites. Grep audit above is what that command would run; doing it once and shipping it as tooling is the difference between a workaround manual and a migration plan.
  • An SDK default for display tied to whether the request is a streaming request. Batch workloads keep omitted; streaming workloads pick summarized automatically. Current default optimizes for one workload while silently breaking the other.
  • A model-release regression-eval template. Every model release is a contract change. Shipping the model with a 50-prompt regression eval template that maps to common workload patterns (chat, coding, agentic) would do more for migration adoption than any blog post.
  • A cost_audit flag in the SDK that surfaces tokenizer-change deltas and cache-breakpoint shifts in usage logs without requiring custom jq pipelines.

If you push 4.7 the way you pushed 4.6, the model will tell you it disagrees, then do a modified version of what you asked, then re-argue when corrected. That is not a regression. That is the model holding up the part of the contract that says “follow instructions literally.” Fix is to write the contract more clearly.

Harness is now the bottleneck. That is this post. Part 2 ships the eval results and a companion repo by 2026-06-10 (30 days from publish). If the eval doesn’t replicate the warnings in Anthropic’s docs, Part 2 says so. Falsifying is the publication. Separate from either of mine, the post Anthropic should write is the migration tooling that makes the harness retune itself. If Anthropic does not ship claude migrate, the third-party tool that does will own the migration moment for every model release going forward, and that is the category opening this contract change creates.