← back to blog
developer productivity

Copilot Had an 11-Hour Outage. Your Productivity Model Didn't Include That.

5 min read

The productivity research on AI coding tools makes one assumption so obvious it never gets stated: the tools work.

GitHub Copilot logged twelve major incidents between November 2025 and April 2026, compared with two in the same period the prior year. In March, a backend storage misconfiguration in GitHub's West US data center caused Copilot to reject authentication tokens for nearly eleven hours. On April 9, 84% of new agent sessions were delayed, with queue wait times hitting 54 minutes against a normal baseline of 15 to 40 seconds. A CNBC investigation published this week described the service as "plagued by frequent outages that have left developers stranded during critical work hours."

Every productivity study that found AI makes developers faster or slower was conducted assuming the AI showed up for work. None of them modeled what happens in a quarter where you have twelve incidents.

From Optional to Infrastructure

METR discovered the dependency shift in February 2026 when they tried to run a controlled productivity study and found that 30 to 50% of recruited developers refused to participate in the "no AI" condition — even for a paid study. Among those who agreed, developers quietly routed hard tasks away from no-AI sessions. METR read this as a measurement problem. The more important reading is that AI had become load-bearing infrastructure for a significant fraction of professional developers.

Load-bearing infrastructure follows different rules than optional productivity tools. When a tool is optional, an outage is inconvenient: you work slightly differently for a few hours and move on. When it's load-bearing — when 30 to 50% of your team won't attempt certain tasks without it — an outage becomes a production incident for your team's output.

Twelve major incidents in six months is not random bad luck. The February 2026 availability report flagged 37 incidents in a single month. The March 11-hour authentication failure wasn't a brief hiccup; it was a full working session where the service couldn't authenticate users at all.

The Outage Math Nobody Is Running

Faros.ai tracked 22,000 developers over two years and found high-AI teams complete 21% more tasks and merge nearly double the pull requests compared to low-AI teams. That's the headline number. What it doesn't account for: the hours in the study period where Copilot was returning 5xx errors or timing out on agent sessions.

The productivity impact of an outage scales with how deeply you've integrated the tool. A developer using AI for inline completions loses some speed when Copilot goes down — they write code more slowly, but they can work. A developer running parallel agentic sessions loses something harder to recover: the parallel workstreams collapse, the context threaded across multiple sessions is gone, and the cognitive model for managing those sessions can't be easily switched to manual mode.

The 21% productivity gain is an average over time, including outage periods. A tool that adds meaningful throughput on reliable days but degrades significantly during multi-hour incidents twice a month may not produce the average number used in the ROI calculation. Variance matters. Nobody is running the variance-adjusted math.

What Your Productivity Data Actually Shows

Most developers tracking coding hours, commits, or PR cycle time see outage weeks as "bad weeks." The data doesn't tag the cause. An 11-hour Copilot outage on a Thursday looks, in retrospect, like a day with too many meetings or a debugging session that went long. If you're not correlating output metrics against service status history, you're attributing the lost productivity to yourself.

This creates a systematic bias in retrospective analysis. Teams running post-hoc comparisons of AI adoption — did productivity improve after we rolled out Copilot? — are measuring weeks that include outage periods alongside fully functional periods. Outage weeks pull the average down. If incidents cluster in certain months, as they did in February 2026, they can make an entire quarter look worse than it would have been at full service reliability. The conclusion drawn might be "AI isn't helping as much as we expected" when the correct diagnosis is "the AI was down repeatedly and we didn't account for it."

The Vendor Concentration Problem

Microsoft directed its entire Experiences + Devices organization onto GitHub Copilot by June 30. The rationale — toolchain unification, fiscal year timing, competitive positioning — doesn't address what a 12-incident reliability record looks like once a team has no alternative.

A team with access to both Claude Code and Copilot can adapt when one service degrades. They shift workflows, lose some velocity, and recover. A team that's been directed to use Copilot exclusively has one option: wait. The single-vendor dependency that makes enterprise management easier concentrates the downside risk.

This is the piece the enterprise mandate analysis typically skips. The productivity research that justifies consolidation mandates assumes reliable availability. The incident record for the tool being mandated includes an authentication failure that lasted longer than a standard working session.

Infrastructure Criteria for Infrastructure Dependency

GitHub publishes monthly availability reports with specific incident timelines. The April 2026 report documented the April 9 delay incident with precise timestamps and the percentage of agent sessions affected. This data is public and goes back years.

If you're tracking your own coding hours, PR output, or session volume, correlating against Copilot's incident history takes an afternoon and reveals whether your productivity variance is behavioral or systemic. The correlation won't be perfect — incidents don't hit every developer simultaneously — but over a six-month window, the pattern becomes legible.

The more strategic question is whether your team is evaluating AI coding tools against infrastructure criteria. If developers won't work without the tool — and METR's refusal data says a significant fraction won't — then the tool needs to be held to infrastructure standards. Uptime, incident frequency, mean time to recovery, fallback availability. The same criteria you'd apply to your database or your CI system.

Most teams aren't doing this because AI coding tools entered as optional productivity enhancements and quietly became something else. The evaluation framework didn't update when the dependency did.

The productivity research will keep producing averages. Your actual average is computed over all weeks, including the ones with 54-minute agent session delays and the one where authentication failed for eleven hours. Knowing which weeks those were — and how they show up in your own output data — is the difference between measuring how you work and measuring your tools.

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free