developer productivity

AI Generates the Code. Nobody Reviews It.

May 29, 20266 min read

A paper published this month analyzed review history across open-source GitHub repositories and found that 61.38% of AI-generated pull requests receive no review at all — no human, no automated agent, no recorded feedback of any kind. On the same teams generating this code, developer velocity metrics are up.

That's not a contradiction. It's a definition problem. Velocity measures output. Review absence accumulates as something else.

What the Paper Actually Found

The study, "These Aren't the Reviews You're Looking For," published on arXiv in May 2026, traced review patterns on AI-generated versus human-authored pull requests in the same repositories. The split is stark.

Of the 38.62% of AI PRs that do receive any review, 58.77% are reviewed exclusively by automated agents — not humans. Only 8.08% of all AI-generated PRs in the study received human-only review, compared to 25.21% for human-authored PRs in the same codebases.

The comment-level data is equally telling. When humans do leave comments on AI-generated PRs, 25.92% of those comments are "agent-steering commands" — instructions directing the generating agent to make changes, rather than evaluating the code itself. On human-authored PRs, that fraction is 1.63%.

So when humans do review AI code, they're mostly managing the agent, not reading the diff. The framing shifted from "is this code correct?" to "change this to what I actually want."

What 8 Million Pull Requests Confirm

LinearB's 2026 Software Engineering Benchmarks report covered 8.1 million pull requests from 4,800 engineering teams across 42 countries. Their data adds a commercial-team dimension to what the arXiv paper found in open source.

Pull requests merged without any review — human or automated — are up 31.3% year over year. AI-assisted PRs at the 75th percentile run 408 lines of code, 2.6 times longer than the 157-line median for unassisted PRs. Agentic AI PRs sit idle 5.3 times longer before anyone picks them up (1,055 minutes versus 201 for human PRs). When reviewers do engage, they accept 32.7% of AI-generated PRs, versus 84.4% for human-written code.

The 32.7% acceptance rate tends to get treated as a quality finding. I think it's better read as a selection effect. That's the acceptance rate on reviewed PRs. It tells you how AI code performs under scrutiny. It says nothing about the 61% that never faces scrutiny at all.

The Seniority Inversion

Here's the part that doesn't fit the usual narrative.

When AI PRs do get reviewed, LinearB found that more senior developers are slower and more skeptical. Junior developers spend 15 minutes reviewing AI-generated PRs and accept 31.9% of them. Senior developers spend 38 minutes and accept only 23.7%.

This is the opposite of how code review normally works. Experienced reviewers usually go faster because they have pattern recognition — they've internalized the architecture, recognized the idioms, and can evaluate a change against the broader system without laboriously tracing through it. They also tend to accept more, not less, because they have the context to judge tradeoffs.

With AI PRs, that pattern recognition is running in reverse. Senior developers have built up a model of what AI reliably gets wrong: the logic that looks sound at first read but misunderstands a precondition, the security assumption that's correct for some deployments but wrong for this one, the edge case that generated tests don't cover because the model didn't know to look for it. The 38-minute session is what it looks like when someone has enough experience to know what to distrust.

Junior developers reviewing AI PRs in 15 minutes aren't necessarily approving code with obvious problems. But they're not catching what experienced reviewers find in their longer sessions.

Two Modes With Nothing in Between

This is the actual shape of the problem. AI code review has bifurcated.

One bucket: intense scrutiny. Senior developers reviewing AI PRs carefully, 38 minutes at a time, rejecting 76% of what they see. The process works, just slowly and expensively in terms of senior engineering time.

The other bucket: no review at all. 61% of AI PRs, merged without any recorded feedback. Whatever the agent generated, that's what ships.

The middle — the normal distribution of code review that most engineering processes assume — has collapsed. A mid-level developer spending 20 minutes on a competent but non-exhaustive review of AI code is apparently not the modal outcome. You get either an intensive senior session or nothing.

This matters because most engineering team processes are designed around the middle. Code review policies assume you're getting a reasonable signal from each review, not that 61% bypass it and the other 39% get either bot-only review or a senior-level audit. If your review policy says "one approval before merge," that policy is producing very different outcomes on AI PRs versus human PRs, and your tooling probably isn't distinguishing between them.

Where the Velocity Numbers Stop

JetBrains published data this month showing a 42% increase in PR closure time despite a high rate of automated review comment resolution. DORA metrics from the same period show a 7.2% reduction in delivery stability for every 25% increase in AI adoption. These exist alongside rising velocity scores.

The reconciliation: velocity measures commits opened, tasks completed, PRs generated. Those numbers are genuinely up. The risk isn't visible in those metrics. It's in the code that merged without human eyes, the agentic PR that sat idle for 17 hours before a junior developer approved it in 15 minutes, the 45.1% of merged agent PRs that, in a separate MSR 2026 study, required post-merge human revision for correctness, documentation, or code style that reviewers missed.

None of that shows up in daily velocity reporting. It shows up in bugs, in security incidents, in code that future developers have to understand without knowing which parts were carefully designed and which were generated and merged silently.

The Question Worth Asking

If you're tracking developer productivity using coding time, commit frequency, or PR throughput, you're measuring what AI is generating. You're probably not measuring what fraction of that output had meaningful human review before it hit production.

The 61% figure comes from open-source repositories. LinearB's commercial team data suggests a similar pattern at scale. The specific percentages will vary — team culture, tooling, PR size norms, how much review practice has adapted to AI generation all matter.

But the question is answerable for your team right now: of the code your AI agents produced this month, what fraction had a human reviewer doing independent evaluation rather than steering the agent toward preferred output?

If you don't know that number, it's probably because you're measuring the generation rate. The review rate is the one that tells you what you shipped versus what you generated. Those are becoming different things.

Written by Kevin — builder of xeve

++related posts

developer productivity

Linear's Agents Write 25% of Issues. The Human Work Moved.

6 min read

developer productivity

The Engineers Who Try CLI Agents Aren't the Ones Who Keep Them

7 min read

Track your apps, coding, music, and health — all in one place.

try xeve free