developer productivity

The Control Group Refused. That's METR's Most Interesting Result.

May 11, 20266 min read

The most important result from METR's developer productivity research isn't the 19% slowdown from their first study. It's that they couldn't run the second one cleanly because developers refused to participate in the control condition.

In February 2026, METR published a post explaining why they're changing their experimental design. The original methodology was rigorous: developers agreed to work on tasks randomly assigned to either "AI allowed" or "no AI" conditions, and METR measured completion time for each. The kind of controlled trial you'd run for a drug study. The first round, published in mid-2025, produced the surprising finding that experienced developers were 19% slower with AI than without.

The second cohort fell apart. Thirty to fifty percent of developers METR recruited declined to participate if it meant working without AI for half the study. Among those who agreed, a participant described a more subtle problem: when the randomizer assigned a task to the "no AI" condition, they'd evaluate whether they could complete it without AI in a reasonable time. If the answer was no, they'd delay it and find a different task. The "no AI" bucket filled up with tasks where AI wouldn't have helped much anyway.

You can't run a placebo trial if the patients select which conditions they apply to.

What Refusing the Control Condition Actually Signals

The obvious reading is a methodology problem. The sample is biased; the experiment is compromised; METR needs a different design. All of that is true.

But the more interesting reading is that the breakdown is itself a finding.

When 30-50% of developers won't work without AI even for a paid research study, that tells you something about what AI has become in their workflow. Not an optional speedup tool. Not a shortcut they lean on for boilerplate. Load-bearing infrastructure that makes certain tasks feel impossible to attempt without it.

Tools that cross this threshold follow a recognizable pattern. IDEs crossed it at some point in the 1990s. Version control crossed it. GitHub crossed it more recently. At a certain adoption depth, "without the tool" stops being a meaningful experimental condition because no competent practitioner willingly works that way, and the small fraction who do are self-selected as people who haven't integrated it yet.

METR hit that threshold with AI coding tools in 2026. Not universally — 30-50% refusal is not 100%, and it varied by experience level and codebase maturity. But the direction is unambiguous.

The Cherry-Picking Problem Is Worse

The developers who agreed to participate in no-AI conditions didn't give METR clean data either. One participant described the mechanism explicitly: when a task was assigned to the no-AI bucket, they'd check whether they could complete it without AI in a time that didn't feel painful. If the answer was no, they'd find a different issue to work on.

This introduces a bias in exactly the wrong direction. The hardest tasks — the ones where AI dependency is highest, where the gap between "2 hours with AI" and "20 hours without" is widest — get systematically removed from the no-AI sample. What remains is a set of tasks where skilled developers lose relatively little by skipping AI assistance.

When you compare "hard tasks with AI" against "easy tasks without AI," you get a number that understates AI's actual impact. METR's second cohort showed only a -4% slowdown (a much better number than the first study's -19%, and within the confidence interval for zero effect). Given the cherry-picking dynamic, this is almost certainly conservative. The tasks that most justify AI use aren't showing up in the no-AI column.

The Measurement Problem We're Left With

METR's problem is everyone's problem. There are two main ways to assess AI's productivity impact, and both have collapsed.

Self-reported data is cheap and fast and demonstrably wrong. The first METR study made this precise: developers felt 20% faster while taking 19% longer. Surveys, polls, "did AI help you this week?" questions — these measure the feeling of productivity, not productivity itself. The feeling is consistently positive and consistently unreliable.

Controlled experiments are rigorous but increasingly infeasible. METR discovered this firsthand. You can't randomize developers to no-AI conditions when they've integrated AI as working infrastructure. The control condition has become unethical to impose and practically impossible to maintain with willing participants.

What's left is observational data. Tracking real workflows as they actually occur, without intervention, and watching how productivity signals correlate with AI usage patterns over time. This is less precise than an RCT — you can't isolate variables the same way, confounders exist, individual variation is high. But it's what you do when the control condition stops being viable.

This is not unprecedented. Pharmacoepidemiology moved from RCTs to observational methods for chronic disease treatments when widespread adoption made it unethical to withhold standard-of-care therapy from a control group. The statistical methods for handling observational data exist. The field adapted. The same adaptation is starting to happen in developer productivity research.

The Experiment You're Already Running

There's one place where the measurement problem nearly solves itself: your own historical data.

You don't need a control group. You have a before-and-after. You started using Cursor heavily eight months ago, or switched from inline completions to agentic tools, or started a new project on an unfamiliar stack where AI assistance lands differently. You have weeks where your AI workflow was running well and weeks where you were in meetings all day or debugging something the models kept hallucinating about.

If you track your time automatically and correlate it with output signals — commits, merged PRs, closed issues, PR cycle time — you have a continuous natural experiment with a sample size of one. That one is you, which is the sample that actually governs your tooling decisions.

The caveat is the same one METR documented: self-assessment is systematically biased in the positive direction. Your gut says AI is helping. The data will either confirm it or show you something more interesting. The developers in METR's study were experienced engineers using current tools on real work, and they were wrong about the direction of the effect by 39 percentage points.

METR's researchers needed rigor and got surprised. The developers they studied had strong intuitions and were also wrong. Both experiences are instructive. The rigor was necessary; the intuitions alone weren't sufficient.

What the Abandoned RCT Tells Us About Where We Are

There's a version of this story where METR's design problem is a temporary inconvenience — better study design, better participant selection, and the numbers will come back. Maybe. METR is exploring cohort designs where participants either use AI for all tasks or none, which avoids the task-level cherry-picking.

But I think the February 2026 update is pointing at something more durable. We are past the point where "AI as optional productivity tool" is a meaningful frame for experienced developers. It's not optional for a significant fraction of them. The refusal rate is the signal.

This means that aggregate RCT numbers — "AI makes developers X% faster or slower" — are going to get harder to produce and easier to challenge. The population of developers willing to participate in "no AI" conditions is shrinking and increasingly unrepresentative of the developers we care most about measuring: the ones who've integrated these tools deeply into their practice.

The measurement gap doesn't go away. It gets filled by observational data at the population level and by personal tracking at the individual level. Neither is as clean as a controlled trial. Both are more honest about the actual state of developer workflows in 2026.

METR couldn't get a control group. That tells you more about where AI sits in the development stack than any speedup percentage they could have published.

Written by Kevin — builder of xeve

++related posts

developer productivity

AI Doubled Your PR Count. Review Didn't Scale.

6 min read

developer productivity

Tokenmaxxing Is the New Commit Count

5 min read

Track your apps, coding, music, and health — all in one place.

try xeve free