developer productivity

The More You Refine AI Code, the Less Secure It Gets

June 16, 20266 min read

A paper published this month ran 400 code samples through 40 rounds of AI-assisted "improvement" and found that critical security vulnerabilities increased by 37.6% after just five iterations. The code passed more tests after each round. It was less secure after each round. Those two things happened simultaneously.

This is the iterative AI refinement paradox: the workflow that feels like quality control is the one accumulating the security debt.

What the Study Found

The paper, "Security Degradation in Iterative AI Code Generation," published on arXiv (2506.11022) this month, tested four distinct prompting strategies across 400 code samples. None of the four improved security with iteration. All four degraded it, at different rates.

The researchers call the mechanism "feedback loop security degradation." When you prompt an LLM to fix a visible problem in its previous output — a test failure, an edge case, a logic error — the model optimizes for the thing you made explicit. It maintains no memory of the security properties established in earlier iterations. It treats each improvement prompt as a fresh constraint to satisfy. The result is code that is progressively better at solving the visible problem and progressively worse at the invariants you never mentioned.

The 37.6% figure is at five iterations. The paper tracked 40 rounds. The curve doesn't flatten.

This Wasn't a Theoretical Risk

Georgia Tech's Vibe Security Radar project has been tracing confirmed CVEs back to their AI-authored commits since May 2025. In January 2026, the project tracked six CVEs directly attributable to AI-generated code reaching production. February: fifteen. March: thirty-five.

By March, the project confirmed 74 vulnerabilities across roughly 50 AI coding tools. Hanqing Zhao, the researcher running the radar, estimates the actual number is five to ten times higher — most AI-assisted commits lack the metadata signatures needed to trace them definitively back to their origin.

The month-over-month trajectory isn't just reflecting more AI code in production. More AI code means more AI code with iterated refinement. Each round of "make this better" is also a round of "introduce a new vulnerability signature." The count is accelerating because the generating process is multiplicative, not additive.

Why It Stays Invisible

The review data doesn't capture iteration history. When a developer submits a pull request, reviewers see the final output. They don't see that the function in front of them is the product of twelve rounds of AI refinement rather than a single generation. Security scanners run on final artifacts, not intermediate states. The iteration depth that determines how many vulnerabilities were introduced is invisible at review time.

This is different from the no-review problem. The 61% of AI-generated PRs that receive no human review are a known exposure. What's harder to see is that the 39% getting reviewed often face code whose security properties degraded through a process the reviewer has no visibility into at all.

Iteration depth also correlates with task complexity. Simple generations reviewed quickly have low iteration depth. Complex features built across a long agentic session with dozens of correction prompts have high iteration depth. Those are the PRs where the vulnerability profile is worst and where reviewers have the least leverage, because the scope is large and the context is difficult to reconstruct from the diff alone.

The Refinement Intuition Is Wrong

Most developers using AI tools have internalized a workflow: the initial generation is rough, and the value comes from iterating toward something better. Write a draft, identify what's wrong, prompt the model to fix it, repeat. This mirrors how you'd work with a junior developer you're mentoring toward better output.

The problem is that a junior developer maintains a mental model of the code across iterations. They know that the authentication check added in step two still needs to hold when the edge case added in step five changes the control flow. They carry the invariants forward. The model doesn't. Each iteration is a fresh context, and the invariants that have to be maintained across iterations aren't preserved unless you explicitly restate them.

Nobody restates invariants. The prompt is "also handle the case where the user ID is null." Not "also handle the case where the user ID is null while ensuring the authentication check from three revisions ago remains valid for this new code path."

ProjectDiscovery's 2026 AI Coding Impact Report, released last month, found that 49% of increased engineering delivery over the past year came from AI-assisted coding tools and that security teams are struggling to keep pace. The report framed the problem as volume — more code means more to review. The arXiv paper reframes it: each additional iteration in the generation process introduces additional vulnerabilities. Security teams aren't just running behind on quantity. The quality signal is degrading in ways they can't see from the outside.

What Changes

The most direct mitigation is treating iteration count as a security signal. Code that took forty AI prompts to produce carries different risk than code that took two. Neither current tooling nor current review practice distinguishes between them.

The arXiv paper's recommendation is explicit: "robust human validation between LLM iterations." Not reviewing every prompt exchange, but periodic checkpoints where a human evaluates whether security properties are intact before the next round of refinement proceeds. That's structurally different from reviewing a final artifact. It requires being present in the iteration loop, not just at the end of it.

None of this is tracked in current developer productivity metrics. Token consumption, commit frequency, editor session time — none of these surfaces iteration depth or security property degradation across a session. You'd need session-level data connecting AI tool usage patterns to post-merge security outcomes to see the signal. The iteration count is knowable in principle. It lives in the session logs of whatever AI tool you're using, which are currently either not exposed or not connected to any downstream quality signal.

The security degradation the arXiv paper documents accumulates in production. The gap between those two facts is where the 74 confirmed CVEs, and the five to ten times as many untraced ones, are coming from.

The code passed its tests. It passed code review. It went through a workflow designed to catch problems with the output.

The workflow just didn't look at the process that produced the output.

Written by Kevin — builder of xeve

++related posts

developer productivity

Opus 5 More Than Doubled the Benchmark. Your Workflow Won't Notice.

6 min read

developer productivity

The Truck Factor Assumes Someone Wrote the Code

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free