developer productivity

Deployment Frequency Is Up. Your Change Failure Rate Too.

May 15, 20266 min read

Your deployment frequency is up since you started using AI coding tools. So, almost certainly, is your change failure rate. You are probably reporting the first number to leadership. You are probably not reporting the second.

Cortex's 2026 Engineering in the Age of AI benchmark tracked engineering teams across a broad range of AI adoption levels. Pull requests per author were up 20% year-over-year. Deployment frequency was up across the board. And incidents per PR jumped 23.5%. Change failure rate rose 30%.

That last number is not getting the same attention as the first two. It should be.

DORA's Core Bet

DORA (DevOps Research and Assessment) spent years studying software delivery teams and landed on four metrics as reliable proxies for engineering health: deployment frequency, lead time for changes, time to restore service, and change failure rate. The key insight behind the framework was that these metrics correlated positively in healthy teams. Elite teams deployed frequently and had low change failure rates. Low performers had both low frequency and high failure rates.

This was counterintuitive. The conventional wisdom before DORA was that deploying more meant breaking more. DORA showed that the causality ran through engineering discipline, not velocity. Small batches, automated testing, strong CI, careful review — these practices simultaneously enabled frequent deployment and kept failure rates low. The correlation wasn't a coincidence. It was the same underlying discipline producing both metrics.

That's why engineering organizations got comfortable tracking deployment frequency as the primary health signal. If deployment frequency was improving, the reasoning went, the underlying discipline must be improving too — and failure rate would follow.

AI broke this reasoning.

What AI Does to the Coupling

AI coding tools make deployment frequency easy to hit. Generate more code, open more PRs, merge more, deploy more. The velocity side of DORA moves in the right direction without requiring the team to improve anything about its testing coverage, its review culture, or its CI discipline. In fact, because AI generates bigger PRs and more of them, it actively pressures those practices downward.

Faros AI's Acceleration Whiplash report put specific numbers on the mechanism: PR size is up 51.3% year-over-year. Median time a PR spends in review is up 441%. Thirty-one percent more PRs are merging with no substantive review at all.

When PRs are 50% bigger and review queues are backed up by a factor of five, reviewers are either drowning or rubber-stamping. The code gets merged. The deployment frequency metric looks excellent. The failures accumulate in production.

The incidents-to-PR ratio, tracking teams moving from low to high AI adoption, has increased 242.7% according to QA Source's analysis of the same period. This is not noise. It is a systematic pattern: the same AI adoption that improves the easy DORA metrics is degrading the one that actually reflects user experience.

Why the Correlation Broke

DORA's correlation between frequency and stability held because both were expressions of the same underlying thing: engineering maturity. You couldn't hit high deployment frequency in 2019 without the testing infrastructure, review processes, and small-batch discipline that produced low failure rates. The process bottleneck forced the discipline. The discipline produced both metrics.

AI removes the process bottleneck for the velocity metrics without removing it for quality. You can now generate and merge code faster than your review process can absorb, faster than your test coverage can validate, faster than your team can build genuine understanding of what is being deployed. The frequency goes up. The maturity doesn't.

This is the mechanical story behind the 30% failure rate increase. It is not that AI-generated code is randomly buggy. It is that the review and validation steps that used to be the bottleneck — and thus were non-negotiable — have become optional in a world where generation is cheap and the pressure to ship is constant.

CodeRabbit analyzed AI-coauthored pull requests against human-written ones and found AI-coauthored PRs contained 75% more logic and correctness errors, 2.74 times more security vulnerabilities, and nearly twice as many error-handling gaps. This code needs more review, not less. The incentive structure under AI adoption is pushing in the opposite direction.

What the Measurement Frameworks Are Tracking

Harness published its State of Engineering Excellence 2026 report on May 13th. The finding that stands out: 89% of engineering leaders say their current metrics accurately reflect AI's impact on their team. 94% of those same leaders also say key factors — including tech debt, validation time, and developer burnout — are missing from those metrics. Only 6% believe their current frameworks can fix the measurement gap.

That is not a contradiction they experience as a problem. They trust the metrics and simultaneously know the metrics are incomplete. Eighty-one percent of leaders in the survey said code review time has gone up significantly since deploying AI. That increase is largely invisible to the dashboards they report from.

The same pattern holds for change failure rate specifically. Deployment frequency dominates engineering AI dashboards because it is easy to instrument, moves quickly with AI adoption, and produces numbers that look good in leadership reviews. Change failure rate is slower, harder to attribute, and tends to show up after the quarter in which the AI deployment happened. It is the metric that tells you whether the deployment frequency gain was real or borrowed against future reliability.

Most organizations are borrowing and not accounting for it.

Which Teams Are Getting Hit

Cortex's benchmark identified a clear pattern in which organizations were seeing the steepest failure rate increases: those with the least formal AI governance. Only 32% of teams in the survey had formal AI policies with enforcement mechanisms. Forty-one percent relied on informal guidelines. Twenty-seven percent had no governance at all.

The teams with no governance had the sharpest jump in incidents per PR. Without structured review requirements, the path from AI generation to merge to production has the least friction. Deployment frequency looks best in exactly these teams. Failure rate looks worst.

The irony is that the governance gap is created by the same speed pressure that AI is supposed to relieve. Teams adopt AI to ship faster. Review processes slow things down. So informal "just merge it" norms emerge, PRs move through with minimal scrutiny, and the failure rate climbs while the velocity metrics look excellent. By the time incidents start piling up, the causal chain is hard to trace back to the review shortcuts taken six weeks earlier.

What to Actually Track

DORA works when you watch all four metrics together, not just the ones that improve easily. If your deployment frequency is rising because of AI adoption, change failure rate and MTTR need to be in the same dashboard, reviewed on the same cadence.

A few signals worth instrumenting now:

PR merge rate without review: What fraction of PRs merge with fewer than two substantive review comments? The Faros data's finding on no-review merges is the leading indicator for the failure rate spike that follows. If this number is rising, your failure rate will follow within weeks.

Incidents per deployment: Not aggregate change failure rate, but failures per deployment event. If this ratio is rising while deployment frequency rises, you are getting faster and more fragile simultaneously, not faster and more reliable.

PR size over time: Bigger PRs mean less review coverage per line of code. The 23.5% jump in incidents per PR in the Cortex benchmark is mechanically connected to review quality degrading as PRs get harder to process. If your average PR has grown since AI adoption, watch what happens to your incident rate on the same timeline.

Post-merge bug rate by PR origin: Teams that tag AI-assisted PRs separately consistently find they carry a different defect pattern — logic errors, security assumption failures, error-handling gaps — that can be caught with targeted review focus. Knowing which PRs carry which risk lets you allocate review depth where it matters rather than treating all PRs as equivalent.

The Assumption Worth Examining

DORA's central contribution was showing that stability and speed don't trade off when the right engineering discipline is in place. That was true and valuable for a decade.

AI has introduced a regime change in what generates deployment frequency. The discipline is no longer required to hit the velocity metrics. You can generate more code, merge it faster, and deploy more often without the testing infrastructure, review culture, or small-batch thinking that used to come along for the ride.

The frequency is real. The discipline is optional. The failure rate is the difference.

If your dashboard is green on deployment frequency and you haven't checked your change failure rate since AI adoption started, you are looking at the half of the picture that got easier. The half that matters more — the half your users experience directly — is probably telling a different story.

Written by Kevin — builder of xeve

++related posts

developer productivity

You Can't Delegate What You Can't Specify

6 min read

developer productivity

Git Pushes Are Up 78%. Most of Them Aren't Human.

5 min read

Track your apps, coding, music, and health — all in one place.

try xeve free