← back to blog
developer productivity

The Rules That Went Viral Are Just Code Review Advice

6 min read

A 65-line text file is currently one of the most-starred repositories on GitHub. The four rules inside it are things engineering managers have been writing in code review comments for decades.

On January 26, 2026, Andrej Karpathy posted a description of what happened after he shifted from 80% manual coding to 80% agent-driven development. He didn't focus on the gains — those were obvious. He described four specific failure patterns he kept hitting: agents that made silent assumptions instead of asking, generated far more code than necessary, made unauthorized edits to adjacent files, and started executing without any shared definition of what "done" looked like.

Developer Forrest Chang distilled that description into a 65-line CLAUDE.md the next day. By May 2026, it had accumulated over 220,000 combined GitHub stars. Last week, Karpathy joined Anthropic to work on pre-training — meaning the person who wrote the most viral complaint about Claude Code's default behavior now works on building that behavior into the model from the ground up.

Where the Recognition Comes From

The 220,000 stars aren't curiosity. They're recognition. Every developer who starred that repository has seen the failure patterns Karpathy described, usually in code review. The PR that touched six files nobody asked it to touch. The service that grew an abstraction layer the original ticket didn't mention. The change that started executing before anyone agreed on what success looked like.

These failure modes aren't algorithmic. They're judgment failures. AI agents are demonstrably good at producing syntactically correct, locally coherent code. They are poor — by default — at the professional behaviors developers accumulate over years: scope discipline, explicit uncertainty when confused, proportionality between problem size and solution complexity. The four Karpathy rules map directly onto the feedback engineers receive in code review until they stop needing it:

"Think before coding" is "don't just start typing — understand the requirement first." "Simplicity first" is "why is this abstraction here?" "Surgical changes" is "you changed three files that weren't in scope." "Goal-driven execution" is "what does done look like before you start?"

Every senior engineer has left at least two of these comments. Some have made them into explicit team norms. The novelty is having to write them down for a tool.

The Accountability Loop AI Agents Skip

When a junior developer makes these mistakes, they get code review feedback. They read a comment that says "why is this abstraction here?" or "you changed three files you weren't asked to touch." In a functional team, the pattern surfaces in retrospectives too. Over two or three years of repeated feedback, they build a mental model of what reviewers care about and what going beyond scope actually costs.

AI agents compress the generation cycle. They do not participate in the accountability cycle.

An agent that makes a silent wrong assumption doesn't receive a comment on the assumption. It receives a comment on the code that resulted from the assumption. The feedback lands on the artifact, not the process. And the next session, the same agent has no memory of the pattern. It will make the same class of mistake again unless you tell it not to in advance.

The CLAUDE.md is a workaround for this. You can't give an agent years of accumulated code review feedback. You can write down the four behaviors that feedback would have eventually produced and put them in a configuration file. It's not elegant. It works. The 220,000 stars suggest the list is the right one — which implies that the right list is exactly what senior engineers spent years internalizing, now written down for the thing that skipped the internalization process.

What the Default Behavior Actually Costs

The agents Karpathy was using before writing those rules were not broken. They were generating real code at real speed. The problem was the shape of the waste they produced.

Silent assumptions generate code that is correct for the assumption the agent made, not the task you intended. You catch this in review — but only if you read the diff carefully enough to notice the assumption embedded in it. Agents generate large diffs quickly, which works against careful reading.

Over-engineering is subtler. An agent that adds an abstraction layer you didn't ask for has made the codebase larger and the next session's context more complex. At the individual PR level it can look like thoroughness. The cost shows up later, when the abstraction turns out to be wrong for where the system actually goes, and you spend a session pulling it out.

Collateral changes — touching files adjacent to the task — create review surface area that shouldn't exist. The reviewer now has to determine whether those edits are intentional improvements or accidental modifications. Both possibilities require time to verify.

None of these are catastrophic in isolation. They compound. A codebase maintained with an unconstrained agent for a few months has more code than it should, more abstraction than it needs, and a commit history full of diffs that touched more than they should have. Review gets harder. Churn goes up. The signal-to-noise ratio in each diff degrades.

What You Can Measure

If you track your coding sessions, the difference between constrained and unconstrained agent behavior is visible in a specific signal: code churn. The percentage of lines written in a session that get reverted or significantly modified in the next few sessions goes down when the agent has explicit scope constraints. Smaller, correctly-scoped diffs review faster and survive longer.

Developers who adopted Karpathy's four rules and tracked their sessions reported measurably lower error rates — one analysis cited a drop from roughly 40% to near 3% after implementing structured CLAUDE.md constraints. That specific number may not generalize to your codebase, but the direction is consistent: constraints reduce the unauthorized-edit and over-engineering failure modes, and those failures are expensive to catch in review.

PR cycle time reflects the same thing. A PR that the agent scoped correctly, with no collateral changes and no abstractions nobody asked for, moves through review faster than one with 600 lines of changes across files the reviewer didn't expect to see.

What you can't see without tracking is how much of your current churn comes from this category of failure versus the model getting the logic wrong on hard problems. If your agentic productivity numbers are below expectations and the model seems capable, the issue may not be intelligence. It may be default behavior on easy decisions.

What the Hire Signals

Karpathy's role at Anthropic is pre-training — building fundamental behaviors into Claude before it reaches any user. That's the right level for this problem. The four CLAUDE.md rules are a post-hoc patch. The actual fix is a model that doesn't need the file because it already treats scope as a constraint, ambiguity as a reason to ask, and complexity as a cost rather than a feature.

That's a pre-training problem, not a prompting problem. Writing "think before coding" in a configuration file tells the model to behave differently during that session. Training against the failure modes Karpathy documented would produce a model that behaves that way by default, without needing the file.

Until that lands, the 65-line file is the most effective tool most developers have for getting AI agents to behave like developers who have been around long enough to know better. The engineering managers who have been writing these same comments in code review for twenty years are not surprised it went viral. They wrote it too. They just called it "feedback."

Written by Kevin — builder of xeve

Track your apps, coding, music, and health — all in one place.

try xeve free