developer productivity

Developers Feel 3x Faster. They Estimate 1.4x More Valuable.

May 31, 20266 min read

METR's May 2026 survey asked 349 technical workers how much AI has changed their productivity. The median self-reported speed gain was 3x. The median self-reported value gain was 1.4-2x. And the group that gave the lowest estimates of all — the most skeptical about their own performance — was METR's own staff.

METR's explanation, offered directly in the paper: their employees are probably familiar with METR's previous findings on the gaps between perceived and actual AI-driven productivity. They've seen the controlled trials. They know that developers in those trials felt 20% faster while measuring 19% slower. So when METR researchers ask themselves "how much more valuable is my work with AI?", they apply a discount that nearly everyone else skips.

That's not modesty. It's calibration.

Speed and Value Are Different Measurements

The May 2026 survey is METR's most methodologically careful self-report study to date. One thing they deliberately built in: separate questions for speed and value.

Speed means time-to-completion — how fast do tasks feel, how quickly you generate output. Value means holistic contribution: the actual impact of the work you shipped, independent of how quickly you typed it.

The difference is stark. Median self-reported speed: 3x. Median self-reported value: 1.4-2x (depending on which of three complementary value questions you use). Developers feel like they're moving three times faster. They believe they're producing roughly twice the impact. That's a 50-100% gap between what you feel and what you think you're contributing — and those are both self-reports from the same people in the same session.

Why is speed so much higher than value? Speed is synchronous feedback. When Claude generates a function in four seconds that would have taken you twenty minutes to write, you notice that. The generation is visible. The time saved registers in real time and gets stored as evidence.

Value is asynchronous and hard to see. Did that function hold up in production? Did it need rework two weeks later? Did it solve the right problem, or did it address the thing you asked for while missing the thing you needed? Speed answers in the moment. Value answers weeks later, often attributed to something else entirely by the time it arrives.

You're not wrong that AI is fast. You're measuring the only thing with an immediate signal.

Why the Most Research-Aware Developers Claim the Least

Here's the interesting implication of the METR employee finding. The survey covered software engineers, researchers, academics, PhD students, founders, and managers. Startup employees gave some of the highest estimates. METR staff gave the lowest.

The same people who designed and ran the controlled trial — the one that showed developers were 19% slower while believing they were 20% faster — are now the ones reporting the smallest gains in a self-report study. Not because they're modest or pessimistic, but because they've seen what the gap between speed-feeling and actual output looks like in hard data, and it changed how they interpret their own intuitions.

METR's hypothesis is stated precisely: familiarity with evidence about the perception-versus-reality gap causes downward calibration of self-reports. When you know that experienced developers systematically overstated AI's impact by 39 percentage points in controlled conditions, you apply a mental correction factor to your own felt sense of productivity. The 3x speed feeling doesn't go away. You just trust it less.

That's exactly what rational updating should look like. The problem is that almost nobody outside of AI safety research orgs has internalized the evidence at the level that shifts their behavior. Startup founders who are heavy AI users and haven't read a 40-page controlled trial give higher estimates. Academics and researchers, who tend to be more comfortable with the idea that self-assessment is unreliable, cluster closer to the middle. METR employees cluster lowest.

The pattern tells you something: the more you understand about the reliability of productivity intuitions, the less you trust yours.

The Forecast Problem

One more number from the May survey worth sitting with: respondents project AI will provide 2.5x value by March 2027. That's up from 2x in March 2026 and 1.3x in retrospective estimates for March 2025.

The trend line is consistent, and part of it makes sense — models are improving, tooling is maturing, and developers are getting better at using both. But it also follows a pattern visible since at least 2023: AI productivity forecasts consistently outpace measured outcomes. Every year, developers expect next year to be dramatically better, then report moderate gains in the present.

METR employees presumably gave lower 2027 projections too, for the same reasons they gave lower current estimates. The mechanism is the same: if you're extrapolating from speed feelings (high) rather than measured value (lower), your forecasts will skew optimistic in ways that controlled data later contradicts.

None of this means the 1.4x value number is fake. A genuine 40% increase in contribution per unit of time is real and meaningful. It's worth having. The problem isn't the 1.4x. It's using the 3x to make decisions — about tooling investment, hiring, workload, what you commit to shipping — when the underlying reality is 1.4x.

Getting Data That Doesn't Lie to You the Same Way

The practical question is: how do you get signal on your own AI productivity that's harder to self-deceive with than "does this feel fast?"

The answer is to track the downstream, not the immediate. Commit counts and lines-generated are speed proxies — the same category as the 3x self-report. They measure generation, not outcome. The signals that are harder to fool systematically:

PR cycle time. If you're generating code faster but PRs are sitting in review longer, the speed gain isn't completing as delivery velocity. That gap is information.

Post-merge stability. How often does AI-assisted code require follow-up fixes within two weeks of merge? Speed to write and stability after ship are different axes, and AI-generated code has a specific defect signature that reviewed-slowly code doesn't.

Useful work window. If a morning of parallel agents leaves you shallow for the afternoon, your daily peak throughput is not your daily useful output. The full-day number is what actually governs what you ship in a sprint.

None of those metrics are perfect either. But they're harder to systematically overestimate in the same direction as speed feelings. They push back when the intuition is wrong. The METR employees aren't running personal RCTs on themselves — they're just applying more skepticism to a data category they have good reason to distrust.

The Calibration Available to Everyone

There's a version of the METR employee move that's available to any developer willing to track the right signals. You won't get controlled-trial precision. But you can watch whether your AI-heavy weeks produce code that merges faster or slower, holds longer or shorter, and actually moves the project forward — or just moves your commit count.

The 3x speed feeling is real in the sense that generation is visibly faster. The 1.4x value estimate is real in the sense that experienced, self-aware people are assigning it. The gap between them is where most AI productivity decisions are currently being made on incomplete information.

The most research-aware developers in this space are also the most skeptical of their own productivity intuitions. That's a meaningful prior. The question is whether you're running on intuition that hasn't been stress-tested, or on data that can push back.

Written by Kevin — builder of xeve

++related posts

developer productivity

Linear's Agents Write 25% of Issues. The Human Work Moved.

6 min read

developer productivity

The Engineers Who Try CLI Agents Aren't the Ones Who Keep Them

7 min read

Track your apps, coding, music, and health — all in one place.

try xeve free