On June 1, GitHub Copilot switches to usage-based billing. The new admin dashboard will surface what each developer consumed: tokens burned, AI credits spent, acceptance rate per engineer. Linear launched Code Intelligence two weeks ago that logs codebase query counts by month. Harness's State of Engineering Excellence report found that 89% of engineering leaders trust their AI productivity metrics.
Every major dev tool is racing to ship an AI usage dashboard. They all measure the same thing. None of them tell you whether your team got more work done.
What Acceptance Rate Actually Measures
GitHub defines acceptance rate as the proportion of AI suggestions a developer keeps in the moment. An engineer with an 85% acceptance rate accepted most of what Copilot wrote. The implication is that high acceptance = high value from the tool.
This logic has a specific empirical problem.
Jellyfish tracked 7,548 engineers through Q1 2026. Nominal acceptance rates in their dataset ran 80-90%. The actual acceptance rate — once post-merge revisions in the two weeks after commit were counted — collapsed to 10-30%. The code was accepted at commit time. It didn't hold. Revisions kept arriving, not because developers were careless, but because AI-generated code surfaces deeper problems after it's been running in context with the rest of the system for a few days.
Acceptance rate at commit measures how much the developer liked the suggestion when it arrived. That is a useful signal about tool engagement. It is not a useful signal about whether the work was right.
A developer with a 95% acceptance rate who ships one feature per week is less productive than a developer with a 40% acceptance rate who ships three. The dashboards don't know that. The billing system sees the first developer as an AI power user and the second as someone not getting full value from the subscription.
Why The Industry Is Measuring This
There's a structural reason AI usage metrics have become the default productivity signal. The data is already in the billing system.
When Copilot moves to usage-based billing, Anthropic, Microsoft, and GitHub all have precise records of what every developer consumed at the session level. Surfacing that as a productivity dashboard is a short engineering step from data that exists for invoice purposes. It's the path of least resistance: the numbers are there, organizations want to justify the spend, and "your acceptance rate improved 12% this quarter" is a compelling story to tell in a budget review.
That convenience is worth naming, because it shapes what gets measured. Organizations don't measure AI usage because it's the best signal for developer productivity. They measure it because it arrived attached to their billing data.
Linear's codebase query counts follow the same logic. The feature exists; logging usage is trivial; growth in queries looks like adoption success. When Linear tells you your agent made 5,200 codebase queries in May — up from 4,267 in April — that's a product adoption metric dressed as a productivity metric. More queries could mean more useful work done. It could also mean the queries aren't landing and people are asking the same question multiple ways.
The Harness survey is where this gets uncomfortable. 89% of engineering leaders say their AI productivity metrics accurately reflect what's happening. 94% of the same group say tech debt, validation time, and burnout are missing from their measurements. You cannot hold both views coherently unless "accurate" means "accurately reflects AI tool engagement" rather than "accurately reflects whether work is getting done." That reframing is exactly what most AI productivity dashboards are doing, often without saying so.
What We Built Instead
When I was designing xeve, I made a choice not to give AI tools a special category.
In your xeve data, Claude Code and Cursor appear as applications — the same way your browser, terminal, Slack, and music player appear. They're not a separate productivity tier with their own scoring. They're apps that consume your time during a working session. Whether you're running Claude Code, reading documentation, debugging in the terminal, or on a Zoom call, xeve tracks it as time spent in whatever you're actually doing.
This is less impressive to demo than a dashboard with an "AI productivity score." But it reflects something true about how work actually functions.
The ratio that matters is not "AI suggestions accepted per hour." It's active working time per unit of committed output — and both halves of that ratio need honest measurement. If you spent four hours with Claude Code open and committed nothing, that shows up as a four-hour session with no output. If you spent 90 minutes and shipped a feature, that shows up as a 90-minute session with a commit at the end. The AI was one tool among many in both cases. The session quality is what differs.
xeve's correlation engine computes the relationship between your session patterns and your output signals: commit frequency, session length before a commit, time distribution across tools during a productive session versus an unproductive one. Those correlations are personal — they reflect how you specifically work, which varies considerably from the aggregate numbers in industry reports. But they're more actionable than an acceptance rate, because they're about your work, not about your engagement with a tool.
What The Pattern Actually Shows
When you track at the system level instead of the AI-usage level, the picture of "high AI productivity" developers looks different than acceptance-rate dashboards suggest.
Many developers who score well on AI engagement metrics show fragmented session patterns at the system level: more total active time per commit, more context switching between coding agents and browser tabs and Slack, longer elapsed time from task start to commit. The AI accelerated the writing phase while making the surrounding workflow more fragmented. The acceptance rate went up; the session quality went sideways.
Developers who score worse on AI engagement often show cleaner patterns: shorter sessions per commit, less context switching within a focused block, a tighter loop between writing, reviewing, and shipping. They're accepting fewer suggestions, but the suggestions they accept seem to need less cleanup.
None of this is visible in a billing dashboard. It's not even visible in most time-tracking tools, because most track only editor activity. The session context — what you were doing in the 20 minutes before you started typing, how many times you switched to a browser tab mid-session, how long the gap between writing and committing was — doesn't exist in IDE-based metrics. It exists in system-level tracking.
You Optimize What You Measure
The more important reason to care about which metrics get built into dashboards: measurement changes behavior.
When engineering leaders receive acceptance-rate reports, developers eventually learn what generates good numbers. They accept more suggestions. They run more agent sessions. The dashboard improves. Whether the work improves is a separate question that no one is tracking.
This is the same dynamic that drove the DORA metric gaming problem in CI/CD — organizations that started tracking deployment frequency as a productivity signal found that teams learned to deploy more often in ways that didn't correlate with shipping more value. Frequency went up; impact stayed flat. The metric moved; the underlying thing didn't.
Acceptance rate, token spend, and query count are all gameable in the same way. They'll go up when organizations start reporting on them. The code review cascades, the post-merge revisions, the sessions that consumed hours and produced nothing — those don't appear in the metric. They appear in your actual working time, if anything is measuring it.
The industry needed a way to justify AI tool spend in budget reviews, and usage dashboards provide that. They're accurate as far as they go. The problem is where they stop. They stop at the boundary of the tool. Everything that happened to you as a developer — the context switching, the review burden, the session quality, the actual output — is outside that boundary.
That's the thing worth measuring. It just doesn't come bundled with the billing data.