quantified self

The Problem With Watching Yourself Work

July 1, 20266 min read

Amazon built a leaderboard to get developers using AI. It worked. Employees started ranking themselves by how many tokens they burned through Kiro, their internal agentic coding tool. They also started feeding the system fake tasks to climb higher, until the company shut the whole thing down in late May.

KiroRank is the cleanest recent example of Goodhart's Law applied to a tech company: when a measure becomes a target, it ceases to be a good measure. Amazon replaced token consumption with "normalized deployments" — a metric that at least checks whether the code actually got committed and used.

But KiroRank has a less-noticed sibling in a completely different field, and putting the two together tells you something important about personal data.

The Sleep Researcher Version

In 2017, researchers at Rush University Medical College and Northwestern University published a paper in the Journal of Clinical Sleep Medicine about a pattern they kept seeing in their clinic. Patients were coming in with insomnia — but the insomnia was being caused, in part, by their sleep trackers.

The pattern: a patient starts tracking sleep. The data shows less deep sleep than average. They become preoccupied with improving the number. The preoccupation causes anxiety before bed. The anxiety disrupts sleep. The tracker records worse sleep. The patient looks at the number in the morning and the anxiety ratchets up.

They called it orthosomnia — from the Greek for "correct sleep" — the obsessive pursuit of tracker-defined perfect rest. It is not a formal DSM diagnosis, but it is taken seriously by sleep medicine professionals and now has its own measurement scale, the Bergen Orthosomnia Scale, published in 2025.

Orthosomnia is what happens when you give someone a number about their sleep and they try to optimize it the way they'd optimize code.

Same Problem, Different Direction

What Amazon's tokenmaxxers and the orthosomnia patient share is that the measurement changed the behavior — but in opposite ways.

KiroRank created a public ranking. When you make a metric visible to a competitive group with career incentives, people play it. The gaming was rational: if the company signals that token consumption matters, consuming more tokens is the right strategy for an employee who wants to be seen as a serious AI adopter. The metric was a proxy for productivity. The proxy became the goal.

Orthosomnia is private. It's just you and your Oura ring data. But it creates a similar perverse loop — the act of watching a number about yourself changes the underlying process the number is supposed to measure. Unlike KiroRank, nobody is watching you optimize. The distortion is entirely internal: anxiety produces worse sleep, worse sleep produces a lower score, a lower score produces more anxiety.

Two failure modes. One public, one private. Both caused by the same root dynamic: attaching meaning to a number changes how you relate to the thing the number describes.

What This Means for Productivity Tools

Most developer productivity dashboards don't think about either failure mode explicitly. They ship charts and assume that visibility into your own patterns is neutral — that knowing you coded for 2.4 hours today is information that helps you plan.

It isn't neutral. How you present a measurement determines how people use it, and how they use it determines whether it makes them better or worse at the thing you're measuring.

The KiroRank pattern appears across team tooling: public dashboards showing commit counts, PR velocities, cycle times by developer. Companies deploy these to create visibility. What they sometimes create is a performance for the dashboard. Developers optimize the proxy — smaller PRs for higher merge frequency, closing easy tickets before hard ones at sprint end. The thing you actually want, which is whether the software is getting better, stays invisible.

The orthosomnia pattern shows up in personal tools: an app that surfaces a "focus score" or "productive hours" count every morning. You wake up, check the number, feel behind, start your day anxious. The anxiety fragments your focus. Your actual work degrades. The score the next day confirms the pattern.

Where Personal Analytics Can Go Wrong

Personal analytics sits in an interesting position relative to both failure modes. Unlike a public leaderboard, it's just you looking at your data — so the gaming incentive is mostly absent. But because it's tracking things you actually care about (your work, your health, your time), the orthosomnia dynamic is entirely possible.

If xeve showed you a single productivity score every day, you'd start to feel behind when the score dropped. You'd start making decisions oriented at improving the number — logging more hours when you felt the score was low, avoiding complex tasks that might not produce visible output. The score would change your behavior in ways that have nothing to do with actually getting better at your work.

The difference between a useful personal data tool and an anxiety engine usually comes down to one design decision: do you present numbers as scores, or as patterns?

Scores suggest "this is where you stand today." Patterns suggest "here is what your data looks like over time, and here is something worth noticing." Scores invite comparison and optimization. Patterns invite curiosity.

A sleep score of 71% invites the question: is 71% good? A chart of your sleep variability over three months, overlaid with your calendar, invites a different question: why was late April rough, and what was different in March?

The first question is about the number. The second is about your life.

How We Approach It

The correlation view in xeve is designed around the second type of question. It pairs data streams that wouldn't naturally be in the same dashboard — coding output with HRV, focus time with calendar density, app usage with sleep quality — and looks for relationships over 90-day windows. Not because any single day's reading tells you something actionable, but because patterns across months reveal something about yourself that no single data point could.

The specific failure mode we try to avoid is the daily score trap: showing you a number every morning that invites the question "did I do well or badly?" That's orthosomnia applied to productivity. The number cannot tell you whether you had a good day. The pattern, over enough time, can tell you what your good days look like and what tends to precede them.

Amazon's replacement for KiroRank points in the same direction. "Normalized deployments" measures an outcome — code that was committed and used — rather than an input like tokens consumed. Outcomes are harder to game and less likely to generate anxiety, because you can't directly optimize them. You can only do good work and let the outcome follow.

The irony of personal analytics is that the data you collect is most useful when you're not trying to hit a number with it. Once you're optimizing the dashboard, you've recreated KiroRank for an audience of one. The measurement has become the goal, and the thing you actually wanted — to understand your patterns, to find what helps you focus, to see whether your health affects your output — is somewhere underneath the performance, unexamined.

Orthosomnia patients often sleep better after they stop looking at their sleep scores nightly. That's not an argument against tracking. It's an argument for knowing what the tracking is actually for.

Written by Kevin — builder of xeve

++related posts

quantified self

Microsoft's Work IQ Knows How You Work. You Don't.

6 min read

quantified self

WWDC 2026 Changed the Privacy Economics for Personal Data Apps

6 min read

Track your apps, coding, music, and health — all in one place.

try xeve free