developer productivity

The Metric AI Productivity Research Keeps Ignoring

May 12, 20265 min read

Two years of IDE telemetry, 800 developers, 151.9 million logged events. The JetBrains Human-AI Experience team presented findings at ICSE 2026 in April, and the number that got the least coverage is the most important one: AI users delete code at 13 times the rate of developers who don't use AI.

Not "produce more code and delete proportionally more." The deletion rate grew nearly twice as fast as the generation rate. AI users typed 7.8 times more characters per month than non-users. They deleted at 13 times the rate. The gap between those two multipliers is where the hidden work lives.

What Deletion Actually Means

When a developer deletes code, they are not undoing an error. They are completing a review cycle: write, read, evaluate, reject. The generation phase — the AI producing output — looks fast. The deletion phase is what it costs to realize that output was wrong, incomplete, or subtly misaligned with the actual codebase.

That cycle is cognitive work. It requires loading the generated code into working memory, checking it against constraints the model didn't fully understand, mentally simulating edge cases, and making a judgment call. Doing this at 13 times the baseline rate means a developer is running that review cycle continuously — not as a break from the "real" work, but as the dominant activity.

This is the write-review-discard loop. It doesn't show up in commit history. It doesn't show up in pull request counts. It doesn't show up in token spend. It shows up in deletion rates, which almost nobody tracks.

The In-Flow Contradiction

Every major AI coding tool markets itself around one idea: it keeps you in flow. Stay in your editor. No more switching to documentation. No more context breaks. The assistant comes to you.

The JetBrains telemetry measured this claim directly. IDE window activations increased by 6.4 per month for AI users. For non-users over the same period, activations fell by 7.6 per month. A 13-switch monthly divergence, moving in the opposite direction from what the marketing promised.

The most plausible explanation is straightforward: AI assistance generates more review tasks inside the IDE, not fewer external ones. A suggestion arrives, gets evaluated, triggers a question, prompts a clarification prompt, generates another suggestion. The interaction surface area grew. Context switching grew with it.

Fifty percent of developers in the JetBrains survey reported no change in their context switching behavior. The logs tell a different story. This is the same pattern METR documented in their speedup study: the perception doesn't update even when the evidence is right in front of you. Developers using AI tools report that they're staying more focused. The behavioral data says their IDE is becoming more active, not less.

The 56% Paradox

The sharpest contradiction in the dataset is this: 56.5% of developers surveyed said their coding time had decreased with AI assistance. Telemetry showed them typing 7.8 times more characters per month than their non-AI counterparts.

Both things can be true if "coding time" in the survey means something different from what the logs measured. Self-reported coding time is how long coding felt. Logged character output is how much the fingers actually moved. AI assistance can make development feel lighter — less mental searching, less blank page anxiety — while the actual behavioral throughput increases substantially.

What the 56.5% number captures is the subjective experience of reduced friction. The assistance is real. But reduced friction in the moment is not the same as reduced effort overall. The deletion data shows the effort didn't disappear. It moved. It became review-and-discard instead of search-and-recall. The cognitive mode shifted; the cognitive load didn't.

This is what the JetBrains researchers mean when they describe AI as redistributing and reshaping workflows rather than reducing them. The work went somewhere. The logs show where.

What Standard Metrics Miss

Productivity research in the AI coding era has converged on a handful of proxies: commit velocity, PR throughput, ticket closure rate, token consumption. These are all counts of things that happened. They share a common blind spot: they measure accepted output, not rejected output.

Jellyfish published data in Q1 2026 showing that developers with the highest token budgets produced twice the PR volume at ten times the token cost. The deletion rate context makes that finding more legible. High token consumption means more generation, which means more review cycles, which means more deletion. The volume is real. The throughput gain is real. But the denominator — total active developer time including the write-review-discard cycles — is not in the calculation.

Imagine two developers working the same hours. One uses AI heavily, generates substantial output, deletes 13x as much as their baseline, and commits the remainder. The other writes slower but deletes less. Both produce the same final commit. Standard metrics see two equal outputs. The first developer ran significantly more cognitive cycles to get there.

The per-hour efficiency question — how much durable output per unit of actual developer effort — requires knowing both what shipped and how much invisible work preceded it. Nobody has that second number unless they're tracking at the system level across the full session, not just the committed artifacts.

The Work That Needs Measuring

There are three things the deletion rate finding suggests worth tracking if you're running AI coding tools seriously.

Active time in AI chat and agent interfaces, separate from editor time. The review-and-discard loop happens partly in the editor and partly in prompt sessions. Most coding time trackers count editor activity. They miss the time spent in back-and-forth with the model before useful output arrives.

Deletion volume relative to commit volume. If your IDE or system-level tracker logs keystrokes or file events, the ratio of what gets written to what gets deleted is a meaningful efficiency signal. A developer whose deletion rate is spiking isn't less capable — they may be in a heavy AI-assisted workflow where the review overhead has grown faster than the output. The number tells you when to look more carefully.

Post-commit revision rate within 14 days. The JetBrains study measured deletion within sessions. The downstream equivalent is code that passes review and then gets revised. Jellyfish found real-world acceptance rates of 10-30% when measured two weeks out, versus nominal rates of 80-90%. Combining that with session-level deletion data gives you a more complete picture of the actual cost of AI-generated output.

We track app sessions and coding activity at the system level with xeve partly because editor-based metrics miss the full session. The time between opening a terminal and committing a change includes browsing, agent interaction, review, and the deletion work the JetBrains study quantified. That full window is what you need to compute genuine output-per-hour ratios.

What the Logs Know That You Don't

The JetBrains study's most important methodological contribution is that it used logs instead of surveys. Surveys produce the consensus view: AI is making me faster, I'm staying more focused, my coding time has decreased. Logs produce the behavioral view: you deleted 13 times more, you switched context more often, you typed more than you ever have and discarded more of it than ever.

These are not contradictory. The consensus view is a real experience. The friction reduction is real. The feeling of working faster is real. What it misses is the deletion rate — the quiet background tax that shows up in the logs and nowhere else.

The gap between what developers perceive and what the telemetry records has now been documented in two years of continuous data. That is long enough to be confident this is not an adjustment period. The write-review-discard loop is the new workflow. The question is whether you are measuring it.

Most teams aren't. The deletion rate is the metric everyone is generating and nobody is looking at.

Written by Kevin — builder of xeve

++related posts

developer productivity

The Control Group Refused. That's METR's Most Interesting Result.

6 min read

developer productivity

AI Doubled Your PR Count. Review Didn't Scale.

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free