← back to blog
developer productivity

Two-Thirds of AI Code Never Reaches Production

6 min read

The LinearB 2026 Software Engineering Benchmarks report dropped this month with a finding that should reframe how teams measure AI adoption: AI-assisted pull requests merge at 32.7%. Human-authored PRs merge at 84.4%.

Two-thirds of AI-generated code never reaches production.

LinearB analyzed 8.1 million pull requests from 4,800 engineering teams across 42 countries. That sample is large enough that this isn't noise from a few teams doing something unusual. It's a consistent pattern in how reviewer behavior and code quality interact when AI throughput scales up without a corresponding change in the review process.

The Queue Behavior Is Revealing

Before a PR gets reviewed, it has to get picked up. AI PRs wait 4.6x longer before review begins. Once a reviewer starts, they move through AI code 2x faster than human code.

The sequence: long wait, fast scan, high rejection.

The 4.6x longer wait probably isn't a conscious decision. Reviewers aren't opening their queue and flagging AI PRs to deprioritize. But something is happening. Teams generating more AI PRs are also stretching their reviewers thinner — more volume, same headcount — so lower-stakes items drift. And reviewers may have learned, implicitly, that AI PRs have a higher fail rate and are putting off the work that more often turns into a conversation.

The 2x faster review seems like a win. But if you're moving quickly to say no, fast review is efficient rejection, not a positive outcome. The speed is explaining the low acceptance rate. Reviewers have gotten good at triage for AI-generated code.

The Arithmetic of Throughput

If you need N production-reaching merges and your human PRs accept at 84.4%, you need 1.18 human PRs per merge. At 32.7%, you need 3.06 AI PRs per merge. You're generating 2.5x as many PRs to reach the same production output.

Now layer in the review time cost. The LinearB data found that teams generating 25-35% more code with AI experience 91% longer PR review times. The extra volume isn't neutral overhead that scales linearly — it degrades the review process for everything in the queue. Your reviewers are spending more time reading code that mostly doesn't ship, which leaves less time for the PRs that do.

This is a system operating under more stress while reporting improved velocity. The metric doing the reporting is PR creation rate. The metric that would tell a different story is merge rate per reviewer-hour.

Teams that are honest with themselves about what AI is doing to their shipping cadence should track both numbers. Most don't.

What Happened After Merge

The story doesn't end at the review gate. GitClear analyzed 211 million lines of code and found that code churn — code that gets significantly altered or discarded within two weeks of being written — rose from a pre-AI baseline of 3.3% in 2021 to 5.7-7.1% by 2025. For teams with the highest AI adoption, churn tracks toward the high end.

This is a different measurement from LinearB's acceptance rate, but it's measuring the same phenomenon from the post-merge side. Even the code that survives review has a meaningfully higher probability of being rewritten before it becomes stable.

The math compounds: AI PRs merge at 32.7% (2/3 rejected at the gate), and the code that does merge churns at twice the rate it did before AI. At the two-week mark, you're spending a larger fraction of your codebase on revision than on new capability.

GitClear's data also flagged an 8x increase in code duplication correlated with AI adoption, and a 39.9% drop in refactoring. AI tools favor copy-paste over reuse. They generate code that passes a surface read and reveals structural problems the further in you look. About 66% of developers report that AI outputs are "almost correct but still flawed" — close enough to merge if you're moving fast, broken enough to require a second pass.

What Reviewers Are Actually Encountering

The 32.7% acceptance rate is the number that results from reviewers reading AI-generated code and deciding, repeatedly, that it doesn't meet the bar. Not what they expected. What they found.

Reviewers processing AI PRs have developed a new informal skill: quickly identifying which AI-generated code has fundamental problems versus which is actually clean. They're doing this triage unconsciously, across thousands of PRs, with no tooling built for it. The 2x review speed is evidence of this skill. It's also evidence that the skill is mostly being used to identify and reject bad code rather than to ship more code.

This isn't a criticism of AI tools. It's a description of where they are in the development cycle. The tools got fast at generating output before the ecosystem got good at filtering output. Teams that treat high PR volume as the productivity signal haven't reckoned with the filter.

The Metrics That Actually Predict Quality

LinearB's 2026 benchmarks found that teams with elite cycle times have half the change failure rate of average teams. Rework rate is a leading indicator for change failure rate. Both of these metrics penalize the AI throughput pattern.

Short cycle time requires PRs that move cleanly from review to merge without revision cycles. Low rework rate means merged code holds. The data suggests AI-assisted code is failing on both dimensions: longer queue time, lower acceptance rate, higher churn post-merge.

This doesn't show up in deployment frequency or PR count. These metrics went up for teams that adopted AI aggressively. The metrics that correlate with system stability — the ones LinearB found to matter — moved in the other direction.

The Part That Doesn't Get Measured

Most developer productivity tracking stops at the commit or PR creation stage. The dashboards that AI vendors built to justify enterprise licensing measure acceptance rate for completions (acceptance as in "did the developer hit tab"), not acceptance rate for PRs (acceptance as in "did this code actually merge"). These are different numbers and serve different interests.

GitHub Copilot has reported completion acceptance rates of 80-90% in its dashboards. The completion acceptance rate measures whether you accepted an inline suggestion. It says nothing about whether the code that suggestion became made it to production.

The LinearB data is measuring at the PR level — the most meaningful quality gate in a real team workflow. At that gate, AI-assisted code passes 32.7% of the time. The gap between the 80-90% completion acceptance rate and the 32.7% PR acceptance rate is where most of AI's apparent productivity gains are actually living.

If you want to know what AI is doing to your own output quality, you need to track where your PRs come from and what happens to them. Not because the number will be exactly 32.7% — it varies by team, codebase, and how you use AI tools — but because without it, you're navigating with the wrong instrument.

At xeve, we track coding time correlated with commits and merged PRs. The correlation between uninterrupted focus blocks and merged PRs is strong and consistent. The correlation between AI session time and merges is more interesting — there are sessions that produce a lot of PRs and few merges, and sessions where one PR merges cleanly. That variance is information. It's the kind of information that starts to answer where in your workflow AI is actually helping versus where it's generating overhead you're paying for in review time and churn.

The LinearB number — 32.7% — is an industry aggregate. Your number is what matters for your decisions. The question is whether you're measuring it.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free