In April, METR published results from MirrorCode — a benchmark that asks AI agents to reimplement real open-source codebases from scratch using only tests and documentation. Claude Opus 4.6 reimplemented gotree, a 16,905-line bioinformatics toolkit, with a 99.95% test pass rate. METR's estimate for how long the same task would take a skilled human engineer: 2 to 17 weeks.
One agent session. Weeks of skilled work.
The result matters on its own terms, but what it's actually exposing is a measurement gap nobody has fixed. When an AI agent ships 16,000 lines of working code in a session you supervised, every productivity tool you use sees something that looks a lot like you barely worked.
The Doubling Problem
METR's Time Horizons benchmark tracks a specific number: the longest task an AI agent can complete successfully 50% of the time. In 2024, frontier models sat at 10-15 minutes. By early 2026, Claude Opus 4.5 reached 320 minutes — just over five hours. Claude Mythos Preview, tested in May, hit a 50th percentile time horizon of at least 16 hours, with the confidence interval reaching 55 hours. The benchmark hit its ceiling: METR ran out of tasks long enough to measure the model properly.
Time horizons have been doubling every 89 days since 2024. At that rate, agents working 24-hour autonomous sessions will be standard practice before the end of the year. Multi-day sessions are coming quickly behind.
This is not a future trend. Developers running agentic workflows today regularly delegate sessions that run for hours while they sleep, review the output in the morning, and iterate. The MirrorCode result — weeks of human effort, one autonomous session — is where the trajectory points. The measurement infrastructure has not kept up.
What the Analytics Actually See
Take a realistic agentic week: you run a 12-hour session on Tuesday. The agent generates the first implementation. You review it Wednesday morning, reject the database approach, write a corrected spec, and run a second session. Thursday it's mostly working; you spend three hours reviewing the diff, fixing edge cases, writing tests for the parts the agent missed. The feature is merged by Friday.
Here's what your productivity tools record:
- Coding time: maybe 5-6 hours across the week
- WakaTime: you were in your editor for a fraction of your normal hours
- Commits: one large one from the agent session, a few cleanup commits
- GitHub: one PR, merged without much back-and-forth because you reviewed it before it was opened
The picture looks like a light week. Nothing in the data captures the 15-minute decision on Wednesday morning that determined whether the feature architecture would hold at scale. Nothing shows the two spec revision cycles that got the agent to mergeable output. Nothing measures what fraction of the agent's first-pass output survived your review versus what you rewrote.
You delivered a week of engineering. The measurement says you were barely there.
The Work That Actually Happened
The gap isn't new — it's just newly visible. Managers who coordinate technical work without writing much code have faced versions of this constantly. What changed is that individual developers are now regularly doing both: delivering substantial engineering through delegation while their direct keyboard output shrinks.
What happened in that week:
Specification work. You wrote the brief that constrained the agent's behavior. Every decision — which tables to use, which edge cases to handle, which existing patterns to follow — was a consequential judgment call. Agents fail or succeed based on spec quality, and spec quality is entirely a function of the developer's understanding of the system.
Review work. You evaluated the output against what you know about the codebase and the architecture. The review decision — ship it, fix it, or redo the approach — determines whether the session output is an asset or a liability.
Steering work. The second spec cycle that got the agent to a working database approach wasn't debugging. It was a judgment call about architecture, communicated in a form the agent could act on. That's engineering, and it's not represented anywhere in your analytics.
None of these activities generate the artifacts that productivity tools are built to measure. No lines of code. Limited commits. Limited time in your editor.
The Metrics That Would Tell You Something
Time-based coding metrics were already unreliable for measuring developer impact before agents arrived. Agents make them close to useless for the developers who've integrated agentic workflows deeply. What would actually give you a real picture:
Spec-to-success rate. How many revision cycles does it take before an agent produces mergeable output? A developer who gets there in one cycle is specifying more precisely — or picking tasks where agents succeed more reliably. Either is signal. The typical AI acceptance rate dashboard doesn't come close to capturing this.
Review yield rate. Of the agent's output that enters code review, what fraction ships without substantial rework? This is the real analog to acceptance rate: not in-session clicks, but actual downstream survival of what the agent produced.
Time from spec to first success. Spec-writing is often the highest-leverage work in an agentic flow. Tracking the time from first file open on a feature to first successful test run gives a rough approximation of spec investment — and it correlates with how many rework cycles follow.
These numbers don't come from the agent's dashboard. They come from system-level tracking of your development sessions and downstream version history. The data is there if you're instrumenting broadly enough.
What This Means for Reading Your Own Work
If you use xeve or any other developer analytics tool, you're probably looking at a picture that undersells your contribution on high-agentic weeks and may oversell it on weeks you're grinding through tedious work your keyboard is producing directly. The correlation between "hours in editor" and "engineering impact" has been weakening for two years. MirrorCode results suggest it's about to break more visibly.
The session categories xeve uses — coding, writing, system, communication — weren't built for a workflow that includes "directing an autonomous agent for 12 hours." We track time in your terminal and editor. We can see when Claude Code or Cursor is active. What we don't yet distinguish is whether that active session is you writing code or you reviewing and steering output the agent produced.
This is the instrumentation gap that matters in 2026. Not which model is fastest on benchmarks — that's increasingly resolved. The open problem is: when a developer's most valuable output is the judgment they exercise over what an autonomous agent produces, how do you measure a week in a way that doesn't make them look like they barely worked?
The first step is probably surfacing agentic session time as its own category — visible separately from direct coding time, so you can see the split between building and directing in your own data. That ratio tells you something about how your workflow has shifted, and whether the shift is producing the output you'd expect from a week's work.
METR's time horizons are doubling every 89 days. The measurement problem compounds with each doubling. The developers who understand their own contribution clearly in late 2026 won't be the ones with the most keystrokes logged. They'll be the ones who can account for the work that happens when they're not at the keyboard.