developer productivity

Coding Agents Are Twice as Effective. Developers Still Want the Copilot.

June 8, 20265 min read

A controlled study from Carnegie Mellon and All Hands AI, presented at CHI 2026, ran developers through equivalent tasks with either a copilot or a coding agent and measured the outcomes. Agents completed tasks correctly 60% of the time. Copilots: 25%. Agents also cut the average time to completion roughly in half — 12.5 minutes versus 25.1 minutes. Then the researchers asked which tool participants would prefer to keep using. 60% chose the copilot.

That isn't a flaw in the study. It's the finding.

What the Study Measured

"Code with Me or for Me?" is the first controlled experiment comparing developer interactions with copilots versus coding agents — not benchmarks, not synthetic tasks, but actual developers working through tasks while researchers measured success rate, time on task, cognitive load, satisfaction, and comprehension of the output.

The productivity numbers for agents are not subtle. A 35-percentage-point improvement in task success. Task time cut in half. 70% of participants completed tasks they said would have been impossible with just a copilot. 75% reported lower cognitive load during the session.

By any reasonable productivity definition, agents won. Then the preference numbers came back the other direction.

The Word Developers Used

The researchers captured why. 55% of participants said they better understood what copilots produced. Agents felt "hard to control." Their outputs were described as "kind of hidden." One word that kept coming up was "autopilot" — used as a descriptor, not a compliment.

"Like autopilot" is the tell. Autopilot is genuinely better at many phases of flying: smoother, more accurate, less fatiguing during cruise. But a pilot who delegates every phase to autopilot and then has to hand-fly a crosswind landing is in a worse position than someone who stayed engaged. The handoff back to manual control is harder than if they'd never disengaged.

The comprehension problem showed up in satisfaction scores too. Despite completing tasks at a 60% success rate, agents and copilots produced roughly comparable satisfaction ratings. Developers finishing a copilot-assisted task they understood scored the experience similarly to developers finishing an agent-completed task they didn't fully follow. Success and comprehension are different outcomes, and the CHI data shows developers weight both of them.

Code Is Not a Task

A task has a completion state. Code doesn't. A function you ship on Tuesday gets debugged on a Thursday two months later. A component you merge gets extended by someone who wasn't there when it was written. A service you deploy gets reviewed in a PR where a teammate needs to understand the design decisions embedded in it.

When a copilot helps you write code, you write the code. When an agent completes a task, you receive code. The distinction determines your relationship to that artifact for its entire lifespan in the codebase.

Receiving agent-generated code is closer to inheriting a PR from a contractor who moved on. You can accept it, deploy it, ship it, and it will often work fine. But you now own something you didn't fully build. The next time something breaks in that area, or a colleague asks why a specific decision was made, or you need to extend it in a non-obvious direction, you're working from partial understanding. You can debug the surface. You can't debug the reasoning.

The 60% who preferred copilots aren't making an irrational choice. They're expressing a preference for code they can own at the level the job actually requires — not just today, but next month.

Where Agents Clearly Win

None of this is an argument against agents. The CHI findings make clear there are tasks where agent output is the right answer and copilot output is simply slower and less reliable.

Environment setup is the clearest case. It becomes fully automated with agents, requires extensive manual work with copilots, and nobody needs to understand each step afterward. Test generation for coverage is similar: you care that tests pass and catch regressions, not that you understand every assertion. Boilerplate that follows a well-defined pattern — CRUD endpoints, migration files, serializers — is the same. Scripts that run once.

These have real completion states. You don't need to maintain the reasoning behind them. The "kind of hidden" quality costs almost nothing, and the 2x speed with a 35-point success-rate improvement is captured cleanly.

The decisive factor is whether the code has a life after the task ends. Disposable or single-purpose code: agents win clearly. Code that gets debugged, extended, reviewed, and explained to teammates: the tradeoff is real, and the CHI data suggests most developers are correctly intuiting it.

What Changes in the Debugging Loop

The study documented something specific about how workflows shifted with agents. Debugging moved from developer-led to agent-led. In the session, this looked like efficiency — less time running test suites and reading stack traces. Task completion time dropped.

The question is what happens to debugging competence over time if the loop consistently belongs to the agent. The developers in the study who found agents "hard to control" weren't complaining about capability. They were reporting a felt loss of legibility — not knowing enough about what happened to take over confidently if they needed to.

This is different from what the METR studies document. METR's finding is that developers overestimate what they ship. The CHI finding is that developers correctly value something productivity metrics don't measure: the capacity to understand, maintain, and explain what they produced. Those aren't the same gap.

What to Actually Track

If you're running coding agents in your workflow, the meaningful signal isn't session completion rate or time saved during the task. Those measure the copilot-style metric — immediate output — applied to agent-produced artifacts.

The signal worth watching shows up later. How often do agent-completed files come back for edits within two weeks of merge? How long does debugging take in areas where an agent handled the original task versus areas where you wrote it yourself? When a teammate tries to onboard to a component, can they read it — or do they need you to explain what the agent did?

These are harder to capture without tracking where your time actually goes across sessions and files. But if you're recording which types of work generate revisit cycles, you can start to see whether agent-completed code is holding or creating follow-on work that doesn't show up in the initial productivity numbers.

The 60% who prefer copilots might be wrong for their specific use case. They might also be accurately reading something the task-success metric misses: that slightly worse completion rates are worth substantially better long-term ownership. The CHI study measured both productivity and satisfaction, and they pointed in different directions. Most developer tooling only watches one side of that.

Written by Kevin — builder of xeve

++related posts

developer productivity

94% of Developers Feel More Productive. The Real Gain Is 12%.

6 min read

developer productivity

Agent Loops Work Great If You Know What Actually Recurs

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free