RESEARCH · JUNE 14, 2026 · 5 MIN READ

Code Writes Itself. Review Doesn't. The MIT/Wharton Numbers.

A May 2026 NBER study of 100,000+ developers found AI coding agents produced 741% more code but only 20% more releases, confirming review as the binding constraint.


Code Writes Itself. Review Doesn't. The MIT/Wharton Numbers.

A May 2026 NBER working paper by Mert Demirer, Leon Musolff, and Liyuan Yang studied more than 100,000 GitHub developers across three generations of AI coding tools. The central finding: autonomous agents produced 741% more lines of code, while actual software releases rose 20%. The bottleneck was never writing code. It has always been everything that happens after.

What the Study Measured#

The paper, NBER Working Paper 35275, traced developer output at every stage of the production chain, from lines written to commits, pull requests, projects, and finally releases. Each generation of tooling showed the same compression pattern. Autocomplete raised lines of code by 228%, but releases by only 10%. Sync agents pushed lines of code to a 741% increase, pull requests up 65%, releases up 20%. Async agents drove lines of code up roughly 1,700%, with releases reaching only 30%.

The gain shrinks at every downstream step. The authors call this the "weak-link hypothesis": accelerating one stage moves the constraint to the next human-dependent stage, it does not remove it.

The researchers also checked four major app marketplaces. New app creation rose sharply. Total user downloads across those same cohorts: flat. The share of new apps failing to reach a minimal audience climbed from about 79% to 86%. More supply, no demand response.

The Elasticity Number That Matters#

The authors estimated an elasticity of substitution of 0.25 between AI-generated output and human review effort. Below 1.0 means the inputs are complements. A factor of 0.25 means they are strongly so. Doubling AI code output does not halve the review burden. It increases demand on review while supply of human reviewer hours stays constant.

That number is the reason 65% more pull requests did not produce 65% more releases. The queue grew faster than the queue could be cleared.

An earlier METR trial found developers using AI coding tools believed they finished work 20% faster when actual completion times ran 19% longer. That perception gap is part of why teams undercount how much of their productivity gain has been absorbed by review, integration, and release overhead rather than captured as shipped software.

Where the Cost Moved#

For roughly a decade, the binding constraint in software development was writing code. AI removed that constraint quickly and at scale. The constraint did not disappear. It moved.

Integrating changes across a codebase that was modified faster than any human reviewed it, managing 65% more pull requests with the same reviewer headcount, handling security patching and dependency hygiene on a volume of code that previously did not exist , none of that got faster because the diff arrived sooner. The expensive part of software development relocated downstream, and most tooling investment has not followed it there.

Anthropic disclosed in June 2026 that more than 80% of its production code merged in May was authored by Claude, producing an 8x increase in code volume per engineer per quarter compared to its 2021–2025 baseline. That is the direction the numbers are moving industrywide.

The Review Problem Is Mostly Mechanical#

The Demirer, Musolff, and Yang paper does not prescribe a solution. It does identify where the attenuation happens: review, integration, security, and release management are the stages where human judgment currently sets the ceiling on throughput.

Not all of that judgment is irreplaceable. Style consistency, known vulnerability patterns, dependency hygiene, API contract compliance, regression risk on changed paths , these are deterministic enough to be checked without a human in the loop. They consume a substantial portion of review time. Automating them does not replace the judgment that matters. It clears the queue so judgment can be applied to the 20% of review that actually requires it: architecture decisions, cross-system effects, and correctness in novel territory.

That is the division of labor the NBER data points toward. The paper's own framing , complements, not substitutes , describes it precisely. More AI-generated code upstream requires more review capacity downstream, and the only path to closing the 741%-to-20% gap is making review faster without making it shallower.

What Autonomous Review Changes#

Hyrax runs across six domains , security, code quality, reliability, API and data contracts, ops, and UX , and executes 13 verification steps before submitting a PR with fixes. It handles the mechanical fraction of review at machine speed: the findings that do not require judgment, the corrections that follow deterministic rules, the patterns that repeat across codebases. The human reviewer sees a smaller queue, a verified fix already attached to each finding, and their attention reserved for decisions that actually warrant it.

The study's weak-link framing is accurate. The answer is not more human reviewers. The review chain has a mechanical segment and a judgment segment. Mechanizing the first one is the only way to stop the second one from becoming a permanent ceiling on what ships.

Hyrax is live at hyrax.dev.


Sources

  1. 01NBER Working Paper 35275 (via Yahoo Finance)
  2. 02Elest.io summary
  3. 03Edward Conard / NBER abstract
  4. 04EverydayCPE analysis