← back to blog
developer productivity

Satisfied With Copilot. Saving Less Than an Hour a Week.

6 min read

86% of developers are satisfied with GitHub Copilot. 60% are saving less than one hour of work per week. The correlation between those two numbers is 0.34.

This comes from "Beyond the Commit," a study presented at ICSE 2026 in April by researchers at BNY Mellon. Survey responses from 2,989 developers. The headline statistic is striking enough: high satisfaction, low time savings, weak correlation. But the real finding is in what those developers are actually measuring when they report satisfaction — and it is not productivity.

What Satisfaction Actually Captures

The study identified six factors developers use to evaluate AI coding assistants, grouped into short-term and long-term dimensions.

Short-term: self-sufficiency (less context-switching to docs and Stack Overflow), reduced cognitive load, task completion rate, peer review ease. Long-term: development of technical expertise over time, and sense of ownership over the code.

When developers report high satisfaction, they are mostly reacting to the first four. Copilot keeps them in their editor. It reduces the friction of looking things up. It makes the cursor move. These are real improvements to the texture of work, and they are genuinely worth valuing.

They are not the same as shipping more code or shipping it faster. And they say nothing about the long-term factors — the ones that only become visible over months.

The 86% satisfaction number is not wrong. It is measuring a real thing. It is just not measuring what most organizations assume when they cite it in a business case.

The 400-Developer Puzzle

The study's most interesting data point is a specific subset. More than 400 developers reported high satisfaction with AI tools while also reporting that they saved less than one hour per week from using them. These developers were not confused or being polite. They were accurately describing two separate things: they liked how the tool felt, and the tool did not significantly change their output volume.

The inverse is equally striking: 100-plus developers who reported saving two or more hours per week remained neutral or dissatisfied. Objectively more productive. Still not satisfied.

What are the dissatisfied time-savers reacting to? The study's interview data points at the long-term factors. One participant described the mechanism directly: "If the code just works, then you just accept it." They were getting output faster and building less understanding in the process. The time savings were real. What they were losing — the sense of ownership, the accumulation of expertise from struggling through a problem — was also real, and their satisfaction rating picked it up even when the time-savings number looked fine.

This is not a niche concern. Anthropic's February study found the same pattern at the behavioral level: developers who generated and accepted code finished faster and scored 25 percentage points lower on comprehension tests than developers who stopped to interrogate the output. The BNY Mellon paper gives that dynamic a name — the ownership and expertise factors that live in the long-term dimension of the framework, and that do not show up in any current metric.

What Organizations Are Actually Measuring

GitHub Copilot publishes satisfaction data. Microsoft's AI productivity reports lead with satisfaction figures. Every major AI coding tool's case study uses it as primary evidence. A 0.34 correlation suggests that is nearly worthless as a productivity proxy.

This matters most for engineering leaders making tooling decisions at scale. If your evaluation of AI ROI is based primarily on developer happiness surveys, you have a loosely correlated signal for the thing you are trying to measure (output), and complete silence on the long-term factors that determine whether your engineers can still handle complex debugging in two years.

The existing frameworks do not help. DORA measures deployment frequency, change failure rate, recovery time, and lead time. SPACE measures satisfaction, performance, activity, collaboration, and efficiency. Both capture some short-term factors. The BNY Mellon paper explicitly notes that neither captures the long-term technical expertise dimension. They were designed before AI coding assistants were a significant part of developer workflows. They do not have a column for whether a developer's understanding of their codebase is getting deeper or shallower over time.

This is not a knock on DORA or SPACE — they do what they were designed to do. It is a structural measurement gap for a problem those frameworks predate.

The Small Team Version

For solo developers and small teams, the satisfaction/productivity disconnect is almost irrelevant — you find out quickly whether you can debug your own code. You feel the ownership question directly because you are the person on call when something breaks at 2 AM.

The problem is more insidious in larger organizations where the person using the tool and the person responsible for production incidents are different, and where satisfaction surveys are a practical shortcut because direct output measurement is hard.

What would actually close this gap is not a better survey. It is tracking time spent in code review versus time in generation, separately. Looking at how long debugging sessions run on AI-generated code versus developer-written code. Watching whether senior engineers are spending more or less time on architecture work that requires the deep codebase ownership that satisfaction metrics do not capture.

None of these are easy to instrument. They require time tracking across all development activities — not just what ends up in a commit, but the full session from editor-open to merge. They require connecting tool usage to output signals with enough resolution to show change over months, not weeks.

What to Actually Ask Yourself

The six-factor framework from the BNY Mellon paper gives you better questions than a satisfaction survey.

Does this tool reduce context switching — does it actually keep me in my editor longer? That is measurable if you track the full session, not just active coding time.

Is it moving my output? Commit and PR data over four to eight weeks will tell you more than a single week's impression.

Is my understanding of the code I own holding steady? Debugging speed on familiar systems is a ground-truth signal. If complex incidents in code you work on every day are taking longer to resolve than they used to, something is slipping.

When something breaks unexpectedly, does it feel like your system or a foreign one? That feeling is information. The developers in the BNY Mellon data who were saving two hours a week but still dissatisfied were detecting something real — the sense that the code was produced by a process they did not fully participate in.

The Correlation Says Something Specific

The 0.34 number is not just "weak." It means satisfaction and time savings share roughly 12% of their variance. The other 88% is independent. These are largely different measurements of different things.

The implication is not that satisfaction is useless. Reduced cognitive load is worth having on its own terms. Staying in your editor rather than bouncing to twelve browser tabs has compounding benefits. These are legitimate reasons to be satisfied with a tool.

The implication is that satisfaction surveys should not be the evidence you use to evaluate whether a tool is improving your output, and they definitely should not be the evidence organizations use to mandate tool adoption, justify seat costs, or assess long-term team capability.

86% of developers being satisfied with Copilot is real. It is just measuring the part of the framework that feels good in the moment. The study's participants who were building less expertise and losing their sense of ownership — they were satisfied too.

The 0.34 is the gap between those things.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free