developer productivity

Tokenmaxxing Is the New Commit Count

May 9, 20265 min read

Engineers with the largest AI token budgets are becoming a status signal in Silicon Valley. Massive token consumption means you're using the tools aggressively, running long agent sessions, letting Claude or Codex rip through your backlog. It's the new "I pushed 50 commits this sprint."

The Q1 2026 data is harder to be excited about.

Jellyfish tracked 7,548 engineers across the first quarter and published the numbers in April. Engineers with the largest token budgets produced the most pull requests — 2x the throughput of their lower-consumption peers. That part got picked up everywhere. The rest of the finding got less coverage: those PRs came at 10x the token cost. "In other words," Jellyfish wrote, "the tools are generating volume, not value."

That ratio matters more than either number alone.

The Acceptance Rate Illusion

There's a second finding that explains how the gap opens. Engineering managers are seeing AI code acceptance rates of 80-90% in their developer tools. That sounds like high-fidelity output — the agents are producing code that almost always lands. But Jellyfish found that the real-world acceptance rate, once you account for revisions needed in the weeks after the initial commit, falls to 10-30%.

The 80-90% figure is measuring whether the code compiled and passed the immediate review. It is not measuring whether the code was right. Developers are merging output that looks correct, then returning to revise it at high rates. The metric looked good. The work continued underneath it.

This is the same problem commit count created a decade ago. Commit count is easy to optimize: commit smaller chunks, more often. It looks like more output. It can be more output. It can also be the same output with higher overhead, fragmented into smaller pieces that each pass review. Commit count as a productivity proxy collapsed because it measured activity, not progress.

Token consumption has the same structure. High token spend means a lot of AI generation happened. It says nothing about whether the generation was useful, whether the code shipped, or whether the shipped code held. Treating it as a productivity indicator — which teams are openly doing, which is why "tokenmaxxing" needed a name — is the commit count error applied to a more expensive input.

Why the Metric Spread

Reid Hoffman got asked about tokenmaxxing in mid-April and gave the honest answer for why it spread: it makes sense as an adoption metric. If you're trying to encourage developers to actually use AI tools, token consumption is a leading indicator. Are they running agents? Are they prompting? Are they in the workflow at all?

That's a legitimate question in 2024 when most developers aren't touching the tools. It's a less useful question in 2026, when 93% of developers use AI at least monthly. Adoption is not the bottleneck anymore. Effectiveness is.

When you shift from "are they using it" to "is it working," token spend stops being useful. It measures the same thing commit count did: that a person was active. Activity and output are positively correlated in most cases, but they can diverge badly when the activity is review overhead, rework, and agent sessions that ran long and produced nothing worth merging.

The Rework Math

The acceptance rate finding has a cost structure worth making explicit.

A developer with an 80-90% nominal acceptance rate on AI suggestions, but a 10-30% real-world rate after revisions, is spending the time that would have produced one accepted, stable block of code producing multiple accepted, unstable ones instead. The initial acceptance event is fast — review, merge, done. The rework arrives later, during debugging or when the code fails under a condition nobody tested for.

That rework is invisible to most productivity metrics. It shows up as a new ticket, a bug fix PR, a debugging session that takes longer than expected. Nobody attributes it back to the original AI generation. The session looks like it produced code efficiently. The downstream cost lives in a different column.

Over a quarter across 7,548 engineers, that adds up to something. The 2x throughput at 10x cost isn't a curiosity. It's a signal that the efficiency gain is being partially consumed by rework, revision, and the overhead of managing volume that doesn't hold.

What to Measure Instead

The question tokenmaxxing raises is really about what to use instead of token spend as a signal. The answer isn't one number — it's a ratio.

Output per unit of input. Specifically: committed, stable code relative to the developer hours that produced it. Not PRs merged (the Jellyfish finding shows PRs can be high while value is low). Not token consumption. Not lines of code, which has been known to be a bad proxy since the 1970s. The ratio of durable output to invested time.

Getting to that ratio requires data on both sides. Output comes from version control history — commits, PR cycle time, bug rate, post-merge revision rate. The time side requires more granularity than most teams have: not just hours logged to a project, but actual active coding time, broken down by tool and session, so you can see the review and rework overhead that doesn't show up as "coding time" in your editor metrics.

That second piece is what most teams are missing. They know their token spend. They know their PR count. They don't know the ratio of those PRs to the active developer hours that went into them, or what fraction of the accepted code needed revision within two weeks of merge. We track app usage and coding sessions at the system level with xeve partly because editor-based metrics don't capture that full picture — the review work, the context-checking, the AI chat sessions that cost time without producing a commit.

Without the full picture, you're measuring the same thing teams measured when commit count was the proxy: inputs and activity, not output and stability.

The Pattern Is Old

Every major shift in developer tooling has produced a new activity metric that looked like a productivity metric until someone ran the numbers. Commit count. Tickets closed. Story points delivered. Lines of code. Each one captured something real about developer activity. None of them captured whether the work was durably good.

Token consumption is the 2026 version. A developer burning more tokens is using the tools. But that doesn't tell you whether the usage produced code that shipped, held, and required minimal rework. The Jellyfish data makes the distinction visible — and it's what separates volume from value.

The developers who will get the most from AI tooling over years are not the ones with the largest token budgets. They're the ones whose token spend produces output that holds. Measuring that is harder than checking a dashboard. It requires stitching together version control data, session data, and post-merge stability. That's more work than tokenmaxxing. It's also the only way to know whether you're actually shipping more.

Written by Kevin — builder of xeve

++related posts

developer productivity

Copilot Moves to Usage Billing. Now Prove the ROI.

6 min read

developer productivity

Vibe Coding Fixed the Wrong Bottleneck

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free