developer productivity

Engineering Leaders Know Their Metrics Are Broken. They're Using Them Anyway.

May 17, 20266 min read

A survey cannot produce two contradictory answers from the same respondents unless something has gone badly wrong with how people think about their own dashboards.

Harness published the State of Engineering Excellence 2026 report on May 13th, based on 700 engineering practitioners and managers across five countries surveyed in April. Two of its findings sit side by side and cannot both be true: 89% of engineering leaders say their current metrics accurately reflect AI's productivity impact. Ninety-four percent of engineering leaders say key factors — tech debt accumulation, validation time, developer burnout — are missing from those same metrics.

You cannot have an accurate picture of something while simultaneously missing the most important parts of it. But that is what 700 engineering leaders told Harness they believe.

What the Contradiction Actually Means

The obvious read is that respondents answered carelessly, or that the survey questions were framed in ways that pulled for different responses. That's possible. Survey methodology is imperfect.

The more uncomfortable read is that this is exactly what organizational measurement dysfunction looks like from the inside. Leaders have dashboards. The dashboards show numbers. The numbers go up or stay flat. That is the metric being evaluated when 89% say their metrics are accurate — do the instruments work, do they produce consistent output, do they report something. On that reading, the metrics are accurate. They're just measuring the wrong things.

What the 94% response captures is a different question: are the metrics capturing the full cost of AI-assisted development? And here, the same leaders admitted they are not. Tech debt isn't in the dashboard. Validation time — the hours developers spend reviewing AI-generated code before committing — isn't tracked. Burnout indicators aren't in the cycle time report.

So engineering organizations are running a system where the instruments work and the measurements are incomplete. Calling that "accurate" is technically defensible and practically dangerous.

The 31% That Doesn't Exist on Any Dashboard

The Harness report puts a number on the gap. Approximately 31% of developer time is now consumed by what it calls invisible work: reviewing AI-generated code, fixing bugs that slipped past that review, and context switching between an increasingly fragmented set of tools. Nearly a third of the working day, generating no signal in any standard engineering metric.

This isn't a small rounding error. If a developer works eight hours and three of them don't appear in any system that leadership looks at, the productivity picture leadership has is systematically off. Not by noise — by a structural gap that grows larger as AI adoption deepens.

What makes it invisible isn't that it's hard to track. It's that the tracking infrastructure was designed before the work existed. DORA metrics — deployment frequency, lead time for changes, change failure rate, mean time to restore — were developed at Google to measure continuous delivery pipelines. They're excellent at what they measure. They were not designed for a world where a significant fraction of developer time is spent evaluating whether the code an AI assistant produced is actually correct.

The same goes for the standard individual metrics: commits per day, PRs merged, story points closed. These count accepted output. They don't count the write-review-discard cycle that precedes the accepted output, or the downstream revisions when accepted code turns out to be subtly wrong.

Eighty-one percent of leaders in the Harness survey said developers spend more time in code review since adopting AI. That time is not in most teams' productivity calculations. It's a known cost that exists outside the measurement boundary.

The Only Six Percent

The most revealing number in the report isn't 89% or 94%. It's 6%. That's the fraction of engineering leaders who believe their current frameworks can actually fix the measurement problem.

Ninety-four percent know the metrics are incomplete. Six percent think the solution is within reach using existing tools. The gap between those numbers — 88 percentage points — is how many leaders have resigned themselves to operating with a known-incomplete picture and no clear path to a better one.

That's a specific kind of organizational failure. It's not ignorance of the problem. The problem is widely acknowledged — 26% of leaders named "measuring true productivity impact" as their single biggest challenge, the top answer ahead of code quality and ROI justification. It's the belief that the problem cannot be solved, which justifies not solving it.

The practical result is that hiring decisions, team sizing, performance reviews, and AI tool budget allocations are being made from a dashboard that the decision-makers already know is missing a third of the relevant data. Not because they couldn't fix the dashboard. Because they've stopped expecting to.

Why This Matters More Than Individual Perception

There's already substantial documentation of the individual-level perception gap — developers who feel faster while moving slower, who report that AI keeps them in flow while their IDE logs show more context switching. The JetBrains telemetry study, the METR controlled trials, the Faros.ai pipeline analysis all point at this in different ways.

The Harness finding is about the organizational layer above that. Individual developers misperceiving their own productivity is one problem. Engineering leadership making resourcing and tooling decisions from known-incomplete data is a different problem with wider consequences.

A developer who misjudges AI's impact on their own work suffers the cost personally. An engineering director who misjudges it for a team of 40 makes decisions that affect 40 people's working conditions, workload, and burnout risk. The Harness survey found 88% of leaders say developer satisfaction has improved since AI adoption — but burnout indicators are missing from their dashboards. There's no way to know whether that 88% reflects genuine improvement or a perception gap one level up from the one METR documented.

What Actually Needs Tracking

The Harness report's recommendations — measure tech debt, validation time, cognitive load, burnout — are correct but vague. Here's what "validation time" actually means in practice, because it's not obvious from the label.

When an AI coding agent produces a block of code, a developer reads it, reverse-engineers the logic, checks it against constraints the agent didn't know about, simulates edge cases, and decides to accept, modify, or discard it. That entire cycle is validation time. For accepted code, it often takes longer than writing the equivalent code from scratch would have. For discarded code, it's pure overhead with no output.

This time doesn't show up as "coding time" in most editor-based trackers, because it doesn't involve editing. It shows up as time spent in a coding session that produced nothing commitable. System-level tracking — measuring time from when you open your editor to when you commit, regardless of what you did in between — captures it. Editor-only tracking doesn't.

PR cycle time is the downstream version of the same signal. If review time has gone up 91% on high-AI teams (Faros.ai's number from 22,000 developers) and that's not in the productivity calculation, organizations are treating a lengthening constraint as a rounding error.

Post-merge revision rate within two weeks is the lagging indicator. Jellyfish's Q1 2026 data showed nominal AI code acceptance rates of 80-90%, collapsing to 10-30% once post-commit revisions are counted. The code passed review. It didn't hold. That rework is invisible in sprint velocity and visible only if you track what happens to a commit in the 14 days after it merges.

The Dashboard You Already Know Is Wrong

The Harness survey's core problem isn't a lack of awareness. Engineering leaders know what they're missing. They named it: tech debt, validation time, burnout. They just haven't built the infrastructure to measure it, and most don't believe they will.

That is a choice. Unmeasured costs don't disappear — they get absorbed somewhere. Sometimes it's developer burnout (which 88% of leaders say is not in their metrics). Sometimes it's accumulated tech debt from AI-generated code that passed review but wasn't quite right. Sometimes it's the silent slowdown of teams where review bottlenecks compound over months until delivery velocity collapses.

The 89%/94% contradiction won't resolve itself. Leaders will continue to trust dashboards they already know are incomplete as long as the alternative — rebuilding measurement infrastructure to capture invisible work — stays out of reach for 94% of them.

The Harness report's real finding isn't that AI has outpaced measurement frameworks. It's that the people running engineering organizations have accepted that their view of reality is materially incomplete, and are shipping product decisions anyway.

That's worth naming. A third of the work is invisible. The dashboards say otherwise. Both of those things are true, and 94% of engineering leaders already know it.

Written by Kevin — builder of xeve

++related posts

developer productivity

The Slot Machine in Your Terminal

5 min read

developer productivity

Deployment Frequency Is Up. Your Change Failure Rate Too.

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free