← back to blog
developer productivity

Reviewers Like AI Code More. That's the Problem.

6 min read

The MSR 2026 Mining Challenge published findings this month from the AIDev dataset — the first large-scale public collection of agent-authored pull requests from real repositories. 116,211 repos, 72,189 developers, five AI agents: Claude Code, Cursor, Devin, GitHub Copilot, OpenAI Codex. Enough signal to start asking questions that vendor-run studies can't.

One of them: how do human reviewers actually respond to code written by AI agents?

The answer is uncomfortable. Reviewers are more positive toward AI-generated pull requests than toward comparable human-authored ones. More neutral-to-positive emotional responses, fewer critical comments, warmer overall tone. This happens while the same code has higher redundancy, more structural issues, and measurably lower maintainability.

The review process is producing the wrong signal. And it's doing it at scale.

What the "More Code, Less Reuse" Finding Actually Says

The specific finding comes from a paper in the Mining Challenge examining code quality and reviewer sentiment across agent-authored PRs in the AIDev dataset. It has a precise description of the problem: LLM agents "frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers."

This isn't about bugs. The code isn't wrong in the traditional sense. Functions return the right values. Tests pass. The logic does what the PR description says it does.

What AI agents consistently miss is architectural reuse — the human knowledge that says: we already have a function that does 80% of this, three files over, written six months ago. Build on top of that instead of writing the new one from scratch.

Agents don't have that knowledge unless you explicitly give it to them. They see the prompt and the local context. They don't see the repo's history of decisions, the patterns established over years of iteration, the abstractions that encode constraints that aren't written down anywhere. So they write something that works, looks clean, and adds structural weight the codebase didn't need.

The reviewer sees well-named functions, consistent style, idiomatic patterns. They don't see that this is the fourth implementation of the same data transformation across the repo, each written by a different agent session in a different week.

Why Reviewers Don't Catch This

Code review evolved to catch human mistakes: logic errors, off-by-one bugs, incorrect API calls, security assumptions that are wrong for the deployment context. Reviewers developed pattern recognition for the kinds of things human developers tend to get wrong.

AI-generated code has a different failure profile. It's not weak on logic — the kind of failures that experienced reviewers have trained themselves to spot. It's weak on architecture: reuse opportunities missed, abstractions duplicated, complexity accumulated quietly across files that no single review touches at once.

The AIDev research captures the sentiment asymmetry precisely: reviewers tend to express "more neutral or positive emotions toward AI-generated contributions than human-authored code." The paper calls it a disconnect. Current evaluation approaches, it notes, "solely measure pass rates, failing to reflect impacts on long-term maintainability and readability."

Part of why reviewers respond warmer is surface quality. AI-generated code is often genuinely more consistent than developer-written code in the same codebase — cleaner naming, fewer style violations, more thorough inline documentation in the PR description. These are signals reviewers have learned to associate with quality. On AI-authored PRs, they're present without implying the architectural awareness those signals usually correlate with.

The reviewer approves a PR that looks right. They feel good about the review. The codebase accumulated a debt they can't see yet.

The Complexity Numbers

The sentiment paradox has a measurable downstream consequence. A Carnegie Mellon study in the same Mining Challenge used staggered difference-in-differences analysis on longitudinal data from open-source repositories to measure what happens when teams adopt coding agents.

Static-analysis warnings rose by 18%. Cognitive complexity rose by 39%. Both increases were sustained across the observation period — not a spike from the initial transition, but a persistent elevation.

Cognitive complexity is not the same as bug density. It's a measure of how hard code is to read and reason about: the number of branching paths, nesting levels, and structural decisions a reader has to hold in their head to understand a function. Higher cognitive complexity doesn't make code wrong. It makes it harder to debug, extend, and review safely next time.

The 39% rise happened across repositories regardless of when agents were adopted. Teams that started with IDE-based AI tools (Copilot, Cursor) and then added autonomous agents saw the same pattern as teams that went agentic first. The quality degradation wasn't about order of operations. It was about what agents build when they build without full architectural context.

The same study found something that complicates the usual argument for stacking tools: teams that had already been using IDE-based AI saw "minimal or short-lived throughput increases" when they added autonomous coding agents on top. The velocity gains from the second tool were largely absent. The quality costs arrived anyway.

The Structural Mismatch

There's a reasonable explanation for why this happens that goes beyond "AI isn't good enough yet."

Code review is inherently local. A reviewer looks at the diff. They understand the functions being changed, the tests that accompany them, the PR description's framing of the intent. They are well-positioned to catch: this function has a bug, this security assumption is wrong, this edge case is unhandled.

They are poorly positioned to catch: this entire feature duplicates something that already exists in a different part of the codebase, and the duplication will cause maintenance problems when the underlying behavior needs to change. That finding requires knowledge that spans the diff — file history, architectural context, awareness of patterns established over time. It's the kind of thing a senior developer who's worked the codebase for years holds in their head, and applies when designing the implementation rather than reviewing it.

AI agents write the implementation without that context. Human reviewers review the diff without the cross-repo visibility that would reveal the pattern. Both are operating correctly within their scope. The failure mode lives between those two scopes, which is why neither catches it.

The MSR paper's conclusion is direct: "current evaluation approaches solely measure pass rates, failing to reflect impacts on long-term maintainability." Pass rate is what review produces. Maintainability impact is what review misses.

What Would Tell You If It's Happening

If your AI-assisted code is accumulating the kind of structural debt the AIDev data describes, the signal won't show up in PR review sentiment. Reviewers will continue to feel good about what they're approving.

The signal shows up in time spent: how often do you return to recently-merged files not to extend them but to understand why they were written that way? How often does a teammate ask you to explain what an agent session produced? How many files, when you open them to extend a feature, turn out to have near-duplicate implementations of something you've already built elsewhere?

Cognitive complexity is measurable with static analysis tools — ESLint's complexity rules, SonarQube's cognitive complexity score, Rust's clippy. Running these on a per-file diff between a month ago and now will tell you whether the trend is heading in the direction the MSR data predicts. The number to watch isn't the score on any single file; it's whether the mean is drifting up over time as agent sessions accumulate.

The cross-file reuse visibility is harder. That requires something closer to a codebase-level understanding of which files contain conceptually similar implementations — semantic search across the repo, or the kind of architectural review that happens in staff-level PR discussions, not individual reviews. Most teams don't do this systematically because most teams don't have a forcing function for it.

Tracking where your time actually goes is one proxy. Time spent in older files that recently grew with AI-assisted code, doing orientation work rather than extension work, is the cost of duplicated abstractions showing up in your schedule. It doesn't show up in code review. It shows up in session logs, if you're keeping them — the difference between "I opened this file to add a feature" and "I opened this file to figure out what was already there."

The Review Process Wasn't Designed for This

The MSR 2026 findings don't suggest code review is failing. They suggest code review is doing exactly what it was designed to do, against a different problem than the one actually being introduced.

Review was optimized over decades to catch incorrect code. AI agents mostly don't introduce incorrect code at the function level. They introduce structural redundancy and accumulated complexity — problems that are invisible in any individual diff and only visible across the full repository over time.

The reviewer who approves AI-authored code with warm sentiment isn't making a mistake. They're making an accurate judgment about what's in the diff. The problem is that what's in the diff is no longer the full picture of what the code is doing to the codebase.

That gap is where the 39% complexity rise comes from. And it's why review getting warmer at exactly the moment code gets structurally worse is not a coincidence. It's the shape of the mismatch.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free