developer productivity

The Agentic IDE Comparisons Are Measuring the Wrong Thing

May 30, 20265 min read

Every "Cursor vs. Claude Code vs. Antigravity" post published since Google I/O evaluated the wrong variable.

Google launched Antigravity 2.0 on May 19th, expanding what had been a single IDE into a full agentic development suite: a desktop app, a terminal CLI invoked as agy, an SDK, and a Managed Agents tier in the Gemini API, all running on Gemini 3.5 Flash. The response was immediate. Within days, developers were posting month-long trial reports declaring clear winners. The typical methodology: try each tool for one to four weeks, note which tasks felt smoother, pick the one that clicks.

That methodology has a documented failure mode. We know exactly what it produces.

What Subjective Tool Comparison Measures

The METR study of experienced open-source developers, published in mid-2025, found that participants using AI tools were 19% slower than participants doing the same tasks without AI. The same developers estimated, after completing the work, that AI had made them 20% faster. The gap between perception and reality was nearly 40 percentage points. They could not feel a slowdown of that magnitude while it was happening. They could not feel it afterward either, reflecting on completed sessions.

A developer who spends a month with Cursor, then a month with Claude Code, then a month with Antigravity, is running an informal version of the same experiment. The perception they form during each period is real. It reflects the texture of the work: how quickly the editor responds, how often completions feel useful, how naturally the tool fits into existing habits. These are genuine experiences. They are not reliable measures of actual output per hour.

If a 19% slowdown is imperceptible in a controlled study where output is measured directly, a five or ten percent difference between two tools is effectively invisible to feel. The comparison post that declares Antigravity the winner because "it just clicked" or Claude Code the winner because "it stays out of the way" is reporting on something real. That something is not which tool produced more committed, stable code per hour of your time.

Why the Competition Matters Anyway

The race between these three platforms is real and it's moving fast. Cursor hit a reported $2 billion in annualized revenue after some of the fastest SaaS growth on record. Anthropic doubled Claude Code's rate limits in early May. Google entered directly with Antigravity 2.0 at $100-200 per month for high-usage tiers, which signals how seriously they're treating the category.

The platforms are genuinely different in design philosophy. Cursor stays inside a visual IDE with human-in-the-loop editing. Claude Code works from the terminal with longer reasoning chains. Antigravity's parallel sub-agent architecture dispatches multiple agents to work simultaneously, one writing components while another runs visual regression tests in a headless browser. For specific tasks, these differences matter.

But the internet's evaluation methodology cannot detect those differences reliably. You need your own numbers.

What an Actual Comparison Requires

To know which tool belongs in your workflow, you need data on two things: how much active time each tool consumed, and what it produced.

The active time side is harder than it sounds. A Claude Code session that runs three hours and ships a complete feature looks identical in your tracking data to a session that runs three hours, generates a lot of intermediate output, requires significant correction, and produces 15 working lines. Both show as three hours of development work. Neither shows up as three hours of Claude Code sessions in your IDE's activity tracker. The tool's usage lives mostly in the terminal. The time you spent reading and validating output in your editor logs as editor activity. The time you spent re-prompting after a failed approach disappears everywhere.

This is where system-level tracking matters in a way editor-based tracking does not. If you run xeve or any other tracker that logs which applications are active and when, you have a real record. You can correlate Antigravity CLI activity with what was committed afterward. You can do the same for Cursor sessions, for Claude Code terminal sessions. The comparison becomes: active time in tool X, commits produced in that window, net lines changed, and whether those changes held for two weeks without correction commits.

That comparison is possible. It requires weeks of each tool, not vibes after a few days. It requires tracking at the session level. And it requires connecting time data to output data by looking at commit timestamps and linking them back to when each tool was active.

Most developers comparing tools are not doing this. They're doing the subjective version, then reporting the result.

The Benchmarks Have the Same Problem

SWE-bench Verified gives a rough capability baseline. As of mid-May 2026, Claude Opus 4.6 holds a narrow lead. Some analyses of real tasks found Claude Code using 5.5 times fewer tokens than Cursor on equivalent work, which matters for cost as much as capability. Antigravity's parallel architecture performs well on tasks that genuinely decompose into independent workstreams.

Those benchmark differences do not map cleanly onto "which tool makes me more productive on my codebase." Benchmarks measure capability on tasks with defined success criteria. Your work does not have those. The tool's performance on your specific stack, your specific error patterns, and your specific task mix is what matters, and benchmarks cannot tell you what that looks like.

The comparison posts tell you what one developer experienced on their specific work for a short period. That information has value. It is not a substitute for your own output data.

The Switch Cost Most Posts Ignore

Switching agentic IDEs has a real cost beyond the subscription price. The context your current tool has built about your codebase, the prompting habits you have developed, the interaction patterns that feel natural — all of that resets on day one with a new tool. You pay that cost in the first two or three weeks of any switch, exactly when your subjective evaluation is forming. The tool you switched to looks worse than it is. The tool you switched from looks better than it is.

If the switch cost is real and subjective evaluation is unreliable, the only principled basis for switching is your own data. Not "Antigravity's parallel agents look impressive" but "I tracked four weeks of Claude Code sessions and my committed output per active hour was X. I tracked four weeks on Antigravity and the same ratio was Y."

Most developers do not have that data because they did not set up tracking before the comparison articles started arriving. Setting it up now, before switching, is still useful. It gives you a baseline from your current tool that any future tool needs to beat to justify the switch cost.

What the XDA-Developers Verdict Means

XDA-Developers published "I tried Cursor, Claude Code, and Google Antigravity for a month and I have a clear winner for you." It will get a lot of traffic this week because the headline is the question everyone is searching. The winner they picked was arrived at through exactly the perception-based process that METR documented cannot detect the actual effect size.

I am not criticizing that piece specifically. It reflects how tools get evaluated in the absence of personal output data. Without a baseline, without session-level tracking, without connecting your activity to your git history, feel is the only available signal. Feel is real. It just is not the question.

The question is which tool ships more of your code per hour of your time. That question has a personal answer that no comparison article can supply. It requires your data, from your sessions, on your codebase.

Google Antigravity 2.0 is worth evaluating. So is Claude Code. So is Cursor. The evaluation that matters is not the internet's.

Written by Kevin — builder of xeve

++related posts

developer productivity

Linear's Agents Write 25% of Issues. The Human Work Moved.

6 min read

developer productivity

The Engineers Who Try CLI Agents Aren't the Ones Who Keep Them

7 min read

Track your apps, coding, music, and health — all in one place.

try xeve free