← back to blog
developer productivity

Uber Burned Its 2026 AI Budget in Four Months

6 min read

The Fortune headline from May 22nd is blunt: "Using the tech is more expensive than paying human employees." It is not describing a hypothetical. It is describing what Uber's CTO told The Information in April, and what Microsoft discovered when it canceled most of its Claude Code licenses roughly six months after expanding access to thousands of employees.

Uber burned through its entire 2026 AI coding tools budget in four months. Not because engineers were being reckless. Because the tools worked, engineers used them constantly, and nobody had modeled what "constant agentic use across 5,000 engineers" would cost.

This is the budget problem that follows from the adoption problem nobody expected to have.

How Token Economics Break at Scale

The pricing model for agentic AI tools is structurally different from every enterprise software contract that came before it. Per-seat SaaS pricing scales linearly — 5,000 seats costs 5,000 times a single seat. Agentic AI pricing scales with usage intensity, which varies by a factor of 10 or more across individual engineers, and which grows nonlinearly as agents chain tool calls.

Here is the mechanism. A Claude Code session does not consume tokens the way a chat conversation does. Each step of an agent loop — reading a file, running a test, editing code, checking the output — sends the entire accumulated context to the model. By step 20 of a debugging session, you are paying for the same system prompt, the same conversation history, the same codebase context, 20 times. The session costs compound with session length in ways that are not obvious from the per-token price.

Uber's engineers were averaging $150 to $250 per month in API costs. Heavy users climbed into the $500 to $2,000 range. At 84% of a 5,000-person engineering org classified as agentic users by March, the arithmetic is not complicated. The budget ran out. In April, the CTO said they were "back to the drawing board" on AI budgeting.

Microsoft's response to the same problem was different. They ended their Claude Code pilot and shifted engineers back to GitHub Copilot CLI. Bryan Catanzaro, VP of applied deep learning at Nvidia, articulated the broader version of what both companies ran into: "The cost of compute is far beyond the costs of the employees."

The Indispensability Trap

The harder problem inside the budget crisis is that both companies hit it because the tools had become genuinely necessary to engineering workflows — not just preferred.

We wrote about this in the context of METR's research in February: when 30 to 50 percent of developers decline to work without AI tools even in a paid study, you have crossed a threshold. The tool is no longer optional. It is load-bearing infrastructure. At that point, limiting access does not save money cleanly; it degrades output in ways that are real and immediate.

Uber's CTO made exactly this point when explaining why simply cutting access was not a clean answer. Engineers had restructured their workflows around Claude Code. The productivity gain was real enough that walking it back was not neutral. You do not just get the budget back; you also lose whatever the tool was producing.

Microsoft chose to cut and redirect rather than cut and absorb the loss. Moving engineers to Copilot CLI is a cost control measure — Copilot's billing is more predictable even after the June 1 usage-based changes. Whether the engineering output holds at the same level is a question Microsoft is now discovering the answer to in real time.

This is the indispensability trap: you cannot afford to keep the tool at full usage, and you cannot afford to remove it either. The budget math does not close in either direction.

What Nobody Was Tracking

The specific failure mode in both cases is the same. Organizations enabled broad access to agentic tools, engineers used them intensively, and nobody had visibility into per-engineer consumption in real time until the billing came in.

This is not negligence. It reflects how these tools were designed. When you run Claude Code, there is no dashboard telling you how many tokens you spent on the last session, how that compares to your team average, or which sessions had the worst output-per-token ratio. The information exists at the API level, but it is not surfaced where engineers work.

The consequence is that engineers have no feedback signal on cost efficiency. A four-hour agentic debugging session that burned 2 million tokens and produced a three-line fix looks exactly like a four-hour session that burned 400,000 tokens and produced a working feature. The developer cannot tell the difference from inside the tool. The CFO can tell the difference only after the invoice arrives.

The same visibility gap applies at the session level. One developer left Claude Code running overnight and woke up to a $6,000 charge. Another hit $4,200 in API fees over a single weekend during an autonomous refactoring run. These are extreme cases, but they are structurally the same case as Uber's aggregate experience — consumption happened without any real-time awareness, and the cost was discovered after the fact.

What Measurement Actually Changes

The exit from the indispensability trap is not "use less AI." Engineers who have restructured workflows around agentic tools will not go back to coding without them, and asking them to produces resentment and real productivity loss.

The exit is making consumption visible before the invoice, and making it visible at the session level where engineers can actually act on it.

This means knowing which workflows burn the most tokens and produce the worst output ratios. In practice, it is usually a specific category of session: long autonomous runs on open-ended or ambiguous problems, where the agent loops through many tool calls, generates code that requires significant human revision, and often produces less than a shorter directed session would have. These sessions are expensive precisely because they generate a lot of activity that does not convert to shipped output.

The measurement you want is not "how many tokens did my team spend this month." It is "which sessions had the worst ratio of tokens consumed to committed output produced." That ratio identifies the category of work where agentic tools are consuming budget without returning proportional value — not because the tools are bad, but because the specific task type does not fit the agentic pattern well.

Tracking time at the system level gives you half of this picture: how long the session ran, what the developer did alongside the agent, when they finally committed. The API billing data gives you the cost side. Together they produce a ratio that is actionable. A session that ran for three hours, consumed $40 in API costs, and produced a 200-line commit is a different productivity story than a session that ran for four hours, consumed $180, and produced a 12-line fix with three subsequent correction commits.

Uber's engineering org had neither number in April. The budget ran out, and the CTO had no specific answer about which workflows were responsible. "Back to the drawing board" is what you say when the data was not there to answer the question before the money was gone.

The Broader Shape of This Problem

Uber and Microsoft are early cases of a problem that will spread as agentic adoption deepens. Flat-rate AI tool costs enabled broad adoption without forcing ROI measurement. Usage-based billing and API-token economics are now forcing the measurement question on organizations that did not build the infrastructure to answer it.

The companies that get through this phase without either cutting productivity or blowing budgets will be the ones that built consumption visibility before the bills arrived. Not because they used fewer tokens, but because they knew where the tokens were going and which sessions were worth the spend.

That information exists. It just is not in the tools that generate the cost. It requires system-level tracking of what developers actually did during a session, correlated with what the session produced and what it cost. That correlation is hard to build after the fact. It is easy to build while the session is happening, if you are already tracking.

Uber's CTO is back at the drawing board. The drawing board needs session-level consumption data, not just monthly totals.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free