← back to blog
developer productivity

Trusting AI Code Is the Wrong Goal

6 min read

The Stack Overflow Developer Survey released last December landed its most quoted number in years: 84% of developers use AI coding tools, and only 3% "highly trust" them. The coverage framed this as a paradox. It isn't one. Trust is the wrong word.

You don't trust your linter. You don't trust git. You understand what these tools do, you've built intuitions about when they produce output you can act on and when they produce output you need to investigate, and you've designed workflows that account for their specific failure modes. That's not trust. It's operational knowledge accumulated through measurement.

The 66% of developers who cite "AI solutions that are almost right, but not quite" as their top frustration aren't struggling with a trust deficit. They're struggling with a measurement deficit. They don't know enough about when AI fails for their specific workload to design around it.

"Almost Right" Is a Measurement Problem

Wrong code fails obviously. The compiler rejects it, the tests catch it, the runtime throws. You know immediately that you have a problem.

Almost right code is different. It passes the type checker. It looks plausible in review. It fails under a specific condition you didn't know to test for, or it solves the problem you described rather than the problem you had. 45% of developers report that debugging AI code takes longer than debugging code they wrote themselves — not because AI-generated code is harder to read in isolation, but because it's harder to know where to look when the surface looks fine.

This failure mode is specific to a class of AI output: code generated for an underspecified prompt, or for a prompt that accurately describes a task but omits a constraint the developer held implicitly. The model produces something that satisfies the spec and misses the intent.

The frequency of this failure mode isn't constant. It varies by task type, by prompt quality, by codebase complexity, by how well the model has been trained on the specific framework or pattern you're using. It varies by developer. Someone who has been working the same codebase with the same tools for eighteen months has calibrated intuitions about which tasks produce reliable output and which require extra scrutiny. Someone in month two is still collecting failure data.

The problem is almost nobody is collecting it systematically.

What the Acceptance Rate Actually Measures

When GitHub launched the Copilot app on June 2nd, the live dashboard showed every agent session in flight: active agents, open pull requests, tasks in progress. A technically impressive view of what the agents are doing. What it doesn't show: how many of those merged PRs come back for revision, what the average time between suggestion and shipment looks like when you include the correction loop, and whether the tasks where AI helps most are the ones where developers actually spend most of their working day.

Acceptance rate — the metric most AI coding vendors surface prominently — measures how often a developer accepted a suggestion. A suggestion accepted and then revised is counted the same as a suggestion accepted and shipped. The post-acceptance delta, the time spent closing the gap between "almost right" and "correct," is invisible to the dashboard.

This isn't accidental. The dashboard is built to justify the tool. Acceptance rate at 80% looks strong in a renewal conversation. What doesn't look strong: acceptance-to-ship time that's 40% longer than code written from scratch because each accepted suggestion generated a correction that wasn't counted.

McKinsey's February 2026 estimate puts AI time savings at 3.6 hours per week on average across developers. 45% say debugging AI output takes longer than manual coding. Both numbers can be accurate simultaneously if the distribution is wide enough — some developers saving substantially more than 3.6 hours, some losing time, the average sitting positive but the variance enormous. The average is real. Its relevance to any individual developer is close to zero without knowing which side of the distribution they're on.

What "Highly Trust" Actually Tells You

The 3% who "highly trust" AI-generated code are not necessarily the developers extracting the most value from it. High trust without calibration data is overconfidence. The same survey found that developers who report consistent productivity gains from AI tools tend to describe the same pattern: they've accumulated enough failure data to know which tasks benefit from AI assistance and which require them to write the code themselves.

They know their AI tool reliably produces good output for CRUD endpoints and struggles with complex stateful logic. Or the reverse. Or something entirely specific to their stack and their prompting habits. They've stopped asking "do I trust this output" and started asking "is this the kind of task where this tool fails for me."

That's operational knowledge. It comes from measurement. And it doesn't transfer — it doesn't carry over to a new tool, a new codebase, or a new class of task without starting the data collection over.

The developers stuck in the 66% frustration bucket aren't there because the tools are bad. They're there because "almost right" is what you get when you use an AI tool without enough calibration data to route tasks appropriately. The tool behaved like a tool. The workflow didn't account for its failure modes.

What to Measure Instead

If you're tracking your development sessions at the system level — how long tasks actually take, when you're in editor versus terminal, where your focused work blocks happen to fall — the shape of the "almost right" problem becomes visible without requiring any instrumentation of the AI tool itself.

Sessions that involve extended AI generation followed by sustained terminal and browser time are a footprint of the correction loop. Sessions where AI output flows directly to a commit are a footprint of tasks where the tool works for you. Over enough sessions, the task types that produce each pattern separate out.

You end up with something more useful than a trust level: a task-type map. The categories where AI assistance compresses your work cycle, and the categories where it adds a verification step that costs more than it saves. That map is personal. It's yours specifically, for your codebase and your prompting habits and your error-detection reflexes. An industry survey can't produce it for you.

At xeve, we track development sessions across your full environment — editor time, context switches, debugging patterns, commit cadence. One of the cleaner signals that surfaces is the ratio of editor-focused time to terminal-and-browser time within a session. Sessions that tip heavily toward terminal time after AI-heavy coding blocks are the "almost right" tax in observable form. We can't see what the AI produced, but we can see what the session looked like afterward.

That signal won't tell you whether to trust your AI tool. But it'll tell you something more useful: whether the tasks you're using it for are the ones where it saves time or costs it.

Trust is a feeling. That's a measurement.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free