← back to blog
developer productivity

AI Doubled Your PR Count. Review Didn't Scale.

6 min read

Teams adopting AI coding tools are merging 98% more pull requests. They are not shipping twice as fast. That gap is not a measurement error — it's where AI's productivity gains go to die.

Faros.ai spent two years tracking telemetry from 22,000 developers across organizations with varying levels of AI adoption. The individual-level numbers look the way you'd expect. High-AI teams complete 21% more tasks, merge nearly double the PRs, and touch 47% more pull requests per day than their low-AI counterparts. Open those numbers in isolation and AI looks like an unambiguous win.

Then they tracked what happens downstream. PR review time on high-AI teams went up 91%. Average PR size grew 154%. Engineers were generating code at a rate that human review couldn't absorb, and the system-level metric — actual delivery velocity, features shipped per sprint — barely moved.

The bottleneck didn't disappear. It relocated.

What a 154% Bigger PR Means for the Person Reviewing It

The obvious version of AI's code review problem is volume: more PRs means more review work. That's real, but it undersells the actual difficulty.

CodeRabbit analyzed 470 open-source pull requests, splitting them between human-written and AI-coauthored code. AI-coauthored PRs contained 1.7 times more issues overall than human PRs. More specifically: 75% more logic and correctness errors, 3 times more readability issues, 2.74 times more security vulnerabilities, and nearly 2 times more gaps in error handling.

So reviewers aren't just seeing more PRs — they're seeing harder PRs. AI writes code confidently and quickly. It doesn't have a model of what the reviewer needs to understand to approve it safely. It doesn't know what the architectural constraint is two layers up that makes this particular approach fragile. It doesn't know to flag that this security assumption is wrong for this team's deployment context.

A human writing code for a human reviewer has absorbed, over years, an implicit model of what reviewers look for. They structure the PR to make that job easier. AI has no such model. The output is optimized for the prompter, not the approver.

The result is a PR queue that is simultaneously longer, larger, and denser with issues than before. Review time up 91% is not surprising given those inputs. The surprise is that anyone thought the output side could scale without thinking about the review side.

Amdahl's Law, Applied to Your Dev Cycle

There's a principle from computer architecture that applies here almost exactly. Amdahl's Law states that the speedup from optimizing one component of a system is limited by the fraction of time that component is actually the bottleneck. Optimize a phase that represents 20% of total time by 10x, and the overall system improves by roughly 17%. Everything else is still constrained by the remaining 80%.

For most engineering teams before AI, writing code was genuinely the bottleneck. Developers spent significant fractions of their workday actually producing code. Anything that sped up that phase produced real system-level gains.

That condition no longer holds the same way on high-AI teams. Writing code got faster. The fraction of total cycle time spent writing shrank. The fraction spent waiting in review, getting review, and fixing what review catches grew — and became the new dominant constraint.

When you measure individual developer output (commits per week, PRs opened, tasks closed), you're measuring the phase you just optimized. Of course those numbers improved. But delivery velocity is a system metric, not an individual one. A PR that is written in two hours and sits in review for three days is not a productivity win. It's a productivity illusion, visible in the individual data and invisible in the system data.

This is why Faros.ai's aggregate numbers look paradoxical. The individual numerators went up. The system denominator — the time from "this is a feature" to "this is in production" — barely moved. Most of the AI efficiency went into building a larger queue.

The Metrics That Disappear

Most developers and most teams are not tracking review cycle time. That is not a criticism — it's genuinely hard to track without dedicated engineering analytics infrastructure. The natural way to measure developer productivity is at the individual level: what did you ship, how many commits, how many hours in your editor. Aggregate that across a team and you have a proxy for team output.

But that proxy fails when the bottleneck is a handoff step. If your reviewers can't keep up with your authors, adding more AI to the authors' workflows makes the problem strictly worse. You are compounding the queue.

The metric that makes this visible is PR cycle time — the elapsed time from PR opened to PR merged. When AI adoption goes up, you expect this number to get shorter (faster development) or stay flat (gains offset by quality issues). In the Faros.ai data, across teams with high AI adoption, it's going in the other direction. More PRs, larger PRs, more issues per PR, longer wait for approval.

The other invisible number is bugs that escape review. The 9% bug-rate increase that shows up consistently across AI adoption studies is a post-merge signal. It means the review didn't catch everything — which, given that PRs are 154% bigger and contain 1.7 times more issues, is not surprising. Reviewers are humans with finite attention. When you give them more surface area per review, they catch a lower fraction of the total issues. The math isn't complicated.

Where the Leverage Actually Is

If the bottleneck is review, that's where the leverage is. Not in tools that generate more code faster — you've already optimized that phase. The question is how to make the review phase proportionally faster.

Part of the answer is AI-assisted code review. Tools like CodeRabbit and Qodo are doing exactly this — automated first-pass review that catches the obvious issues before a human has to touch the PR. This is a meaningful category precisely because AI-generated code has a specific, analyzable issue signature: logic errors in generated functions, security assumptions that were wrong, readability problems from code that no human read before submitting.

Part of the answer is PR discipline. AI makes it easy to generate large, comprehensive changes in a single session. Human review doesn't scale with that. Smaller, more focused PRs review faster, catch more issues per issue, and move through the queue without blocking other work. This is not a new principle — senior engineers have known it for decades — but AI creates a new pressure in the wrong direction. You can now generate a 600-line PR in the time it used to take to write 100 lines, and every incentive during generation pushes toward including more.

The harder part of the answer is tracking the system, not just the individual. If you only measure commits and coding hours, you will continue to optimize the phase that is no longer the constraint. Review cycle time, PR size trend over time, bug rate per PR — these are the signals that tell you whether AI's individual gains are making it to the system level.

The Productivity Gain That Went Somewhere

AI coding tools are producing real output gains at the individual level. That is not in dispute. The Faros.ai numbers show it clearly: 21% more tasks, 98% more PRs, 47% more daily touches. Engineers are doing more individual work.

The question is whether that individual work is completing as team delivery. And on the current evidence, for many teams, most of it isn't. It's sitting in a review queue that grew 91% without a proportional increase in review capacity.

The optimization happened. It just happened in the wrong place. The code writes itself now. The question is who reviews it.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free