Friction Is the Feature
Becker et al. (METR, 2025) ran a randomized controlled trial across 246 real issues in 16 experienced developers’ own open-source repositories and found AI assistance slowed completion time by 19%; the same developers, even after the data came in, still believed AI had sped them up by about 20%. That study is from July 2025 and remains the largest-N controlled comparison of AI-assisted versus unassisted development published to date. The 39-percentage-point gap between perceived speedup and measured slowdown is the cleanest summary I have read of what AI tooling does to people who use it without discipline.
Without deliberate friction, AI assistance compounds capability and compounds dependence. After eighteen months of heavy use, I shipped more than I have in any comparable stretch of my career, and I caught myself losing the muscle that made any of it possible. The four-practice discipline below is what I run now to keep the first half without the second. If you read nothing else, wire one test runner into CI on push before lunch Monday: it is the single best gate to start with, and the post explains why under “Gate every change.”
Two of the five studies cited below are first-authored by Anthropic-affiliated researchers; treat the direction as confirmed by the independent METR, Zhou et al., and Beck et al. work, the magnitudes as provisional pending cross-lab replication.
The evaluation-substitution loop
The mechanism is straightforward: a model emits a candidate answer, and the developer’s job shifts from generating a solution to ratifying one. The same evaluative posture that would catch a junior teammate’s bug gets relaxed because the candidate looks fluent.
Shen and Tamkin (2026) randomly assigned 52 software engineers to learn an unfamiliar Python library with or without AI; the AI-assisted group averaged 50% on the follow-up comprehension quiz against 67% for the hand-coding group, a gap that widened on debugging questions and came with no measurable time savings. Seventeen percentage points is a lot of comprehension to give up for nothing. Inside the AI-assisted group, engineers who asked the assistant conceptual questions scored above 65% on the per-paper table; the ones who delegated code generation wholesale scored below 40%, so the tool was not the problem, the interaction pattern was.
Zhou et al. (ICSE 2026) coded 2,013 development actions across 14 developers and found that 56.4% of LLM-assisted actions were cognitively biased versus 48.8% of total actions. The paper labels “Instant Gratification” and “Suggester Preference” as the dominant LLM-specific bias categories (Table 3 in the PDF); the second one names the failure mode directly. Developers keep accepting the path the suggestion sets, because the cognitive cost of generating an alternative is higher than the cost of ratifying what is already on screen.
Beck et al. (2025) randomized 2,784 participants on an AI-assisted annotation task and reported that pro-AI attitudes were the single strongest predictor of accepting incorrect suggestions, stronger than age, education, or task experience. The pattern under all three studies is the same: confidence rises, accuracy falls, and the people most enthusiastic about the tool are worst at catching it wrong.
The reinforcement-schedule loop
The mechanism is a variable-ratio reward delivered in under a second. Each prompt may or may not produce a useful completion; the unpredictability is what makes the schedule sticky. Shen et al. (CHI 2026) coded 334 chatbot-addiction posts across 14 subreddits and named the pattern the “AI Genie”: minimal-effort on-demand fulfillment that produces self-reported salience, tolerance, conflict, and relapse, the same component signature behavioral psychologists use for gambling and gaming disorder.
The Shen et al. corpus is chat-companion users on Reddit, not Claude Code users; I am extrapolating that the schedule mechanics are similar because the sub-second loop and variable reward are the same shape, not because anyone has measured it on a coding assistant. Treat the conclusion as a hypothesis worth a study, not a finding worth a citation.
What the hypothesis predicts in practice: restless when the assistant is slow, irritated by tasks that require waiting, reaching for the prompt before the question is fully formed. Recognizing that pattern in yourself is the precondition for the third practice below.
The methodology in one line
Friction is the feature. The phrase first appeared in dumber-models-on-purpose, which argued for routing simple tasks to less capable models so you have to think first; this post extends the same heuristic into a four-practice discipline that engineers friction back into the workflow once you can no longer rely on the model being too dumb to help.
The four practices are the methodology. Read them as a list because the list is the discipline:
- Demand a citation.
- Gate every change.
- Break on the counter.
- Teach the catch.
Each one targets a different failure mode from the two loops above, and each one costs something measurable.
Demand a citation
The practice is simple: any factual claim a generated response includes gets a primary source attached before the claim is allowed into a draft, a commit, or a PR description. Not a confident assertion, not a paraphrase of what the documentation probably says, an actual link or a verified quote.
This is not paranoia. It is a response to specific incidents. My pre-publish fact-check rule exists because off-by-one errors kept slipping through fluent summaries. The generated summary asserted a script was 188 lines; it was 187. The generated summary asserted a workflow had 31 scripts; the count was 32. Each error survived a paragraph of explanation that read fluently and was wrong. Technical readers and AI search engines both read specificity as a citability signal, so the cost of a single off-by-one is a citation that does not happen, a hiring manager who reads the post as careless, a dependent claim downstream that inherits the error.
The smallest version of the practice is a prompt template you paste at the start of any research-heavy session:
For every factual claim in your response, attach the primary source URL inline in the same sentence as the claim. If you cannot produce a primary source, mark the claim as “unverified” and stop. Do not infer percentages, dates, or counts you have not seen.
Run it. Watch the prompt template suppress unsourced output. The friction is the feature.
Gate every change
Start here on Monday: one test runner wired into CI on push and PR, one lint or type-check that fails the build, one pre-commit hook that re-runs both before any commit lands. Three gates, three friction points, every commit, no exceptions. That is enough to catch the loud failures, because generated code that is confident-but-wrong does not survive a runtime exception.
The bigger gate set is a factoring problem, not a count problem. In the bdigital.media monorepo, each check answers a specific class of failure: the test suite catches logic regressions, the lint catches style drift, the humanizer catches voice drift, the SEO check catches title and meta regression, the image-validation pass catches broken refs, the security audit catches new vulnerable dependencies. The point is not the inventory; it is the 1-to-1 mapping between a class of past failure and a check that catches the next one.
The cluster-analysis methodology in diagnosing-ai-review-false-positives shows how to turn AI review noise into trustworthy signal once you start logging false positives by category, and ai-code-review-fix-patterns catalogs the recurring fix patterns that emerge once review feedback is gated through real CI rather than human attention alone. Both posts assume the gate exists; this post is about why the gate is non-negotiable.
Break on the counter
Leave the keyboard on a mechanical trigger, not a feeling. Feelings drift in the direction of the reinforcement-schedule loop; mechanical triggers do not. The trigger has to be observable and binary, something you cannot rationalize past.
Mine is “every five commits, walk for ten minutes.” The commit count is a count, not a vibe, and it fires whether I want it to or not. creatine-cognitive-performance-developers covers the embodied health practices developers tend to skip when the work is going well; the relevant operational fact here is that affective withdrawal from a heavy session resolves quickly on re-engagement, which means the easiest thing to do at minute 90 is open the prompt again, and the easiest thing is the wrong thing.
Pick one mechanical trigger and one rule. Yours might be a Pomodoro, a token threshold, or a chime. The trigger does not have to be elegant; it has to fire whether you want it to or not. Even with a rule I still drift, which is why the discipline only sticks once the trigger is automated, a post-commit hook that counts and a notification that fires at five whether I am in flow or not. Mechanical rule, mechanical enforcement. Without the stop, the loop keeps closing.
Teach the catch
There are 36 tech posts in the sites/tech/src/content/blog/ directory as of this morning, written across the last six months alongside everything else. Eleven training customers use the Claude setup material I built. Both numbers exist because publishing forces a level of rigor that private notes do not: the moment a claim is written for someone else, the off-by-one errors surface, the unsupported intuitions get caught, the things I thought I understood reveal where I do not.
exponential-ai-snowball-case-study documented the capability compounding from a single 16-hour Claude Code session that touched 11 files and produced both a published post and CI scaffolding, which is the same pattern at session scale that this practice runs at week scale. Every artifact teaches the next one. The corollary, the one most engineers skip, is that the teaching itself is the friction. Writing the post forces the fact-check, the fact-check forces the citation, the citation forces the actual reading, and the actual reading is what builds the muscle that the AI-assisted group in Shen and Tamkin lost.
The discipline that makes the rest of it stick: pick one thing learned this week and write it up before you let yourself learn the next thing. Not for the audience, for the catch. The audience is the gate that forces the catch to be honest, which is what makes any of the four practices worth the time they cost.
Measuring the discipline
A discipline that argues for evidence over vibes has to produce its own evidence, or it is vibes wearing a methodology costume. Three metrics, logged weekly, n=1, written to a file the assistant cannot edit:
- AI suggestions accepted vs reverted at PR review. Tag commits with an
ai:prefix in the message and count reverts against that tag. The ratio is the falsifiable form of “the gates are catching what they should.” - CI rejections of AI-authored diffs. Pull from GitHub Actions failure logs filtered by commit author. A concrete recent instance from my own log: on 2026-05-08, the AI-pattern-check gate rejected a prompt-engineering post for a banned-word violation that I had not noticed in review. The gate caught it; I did not. That is the kind of incident the metric is supposed to surface.
- Fact-check catches per published post. Count from the pre-publish review pass against drafts, categorized by failure mode (off-by-one numeric claims, stale doc URLs, anthropomorphism creep, fabricated quantitative claims). The category breakdown is the regression-eval input.
The point of the log is not any single number. It is the existence of a feedback loop that would notice if the discipline stopped working. If the revert rate climbs over time, the gates need re-factoring. If the fact-check catches drop to zero, either I got rigorous or I stopped checking, and the second hypothesis is the one to assume by default.
What I want the tool to ship
The four practices above are workarounds for missing defaults. I would prefer not to need any of them. The platform-level versions:
- A citation mode at the request layer that refuses to emit unsourced factual claims and surfaces the source URL inline next to each claim. Available as a per-session toggle, not a per-prompt template.
- A commit-counter break trigger as a Claude Code primitive that fires a notification at a user-defined commit count, without me having to wire a post-commit hook to do it.
- An explain-before-generate mode for code suggestions that requires a one-sentence rationale before the diff, so the developer sees the reasoning before the artifact and stays in the evaluative posture rather than slipping into ratification.
None of these are speculative. They are the natural shape of what the cited studies measured, expressed as product affordances. If a tool vendor reads this and thinks “but the responsibility is on the user,” that is the response that the post is arguing against.