RESEARCH · JUNE 11, 2026 · 5 MIN READ
59.4% of Agent Tokens Go to Code Review, Not Code Gen
Concordia research measured token consumption across six SDLC stages and found code review consumes 59.4% of all tokens, while initial generation uses just 8.6%.
59.4% of Agent Tokens Go to Code Review, Not Code Gen
Most engineering teams buying into agentic development assume the expensive part is generation , the model writing thousands of lines, reasoning through architecture, producing output. New research from Concordia University says that assumption is wrong by a factor of nearly seven. Code review is where the money actually goes, and the reason is architectural, not incidental.
What the Concordia Study Found#
The paper, "Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering," analyzed execution traces from 30 software development tasks run through ChatDev, a multi-agent framework with agents assigned roles including programmer, tester, and code reviewer. Researchers led by Emad Shihab at Concordia's Data-driven Analysis of Software lab mapped token consumption across six stages: design, coding, code completion, code review, testing, and documentation.
Code review consumed 59.4% of all tokens. Initial code generation: 8.6%.
Across all stages, input tokens , context being fed into the model , made up 53.9% of total consumption. Output tokens were 24.4%. The coding stage is the only outlier: it runs output-heavy at 58% output versus 6.9% input, which makes sense. A single instruction can produce hundreds of lines. Every other stage, including testing and documentation, is dominated by input.
The study used a single framework and a single model, GPT-5, across 30 tasks. ChatDev is a research framework, not a production tool. The specific percentages may shift in other environments. The underlying dynamic , that iterative refinement and verification loops are structurally more expensive than generation , is likely to hold across conversational multi-agent architectures.
Why the Number Is So Large#
The Concordia researchers call it a "communication tax." In a conversational multi-agent system, agents engaged in code review pass the full codebase back and forth on every turn. The model is stateless. Each turn requires re-sending everything that came before: the codebase, prior review comments, every tool result. Context accumulates. A single Claude Code session measured at tokenscope.pages.dev , one real session, n=1 , reached a peak context of 998,000 tokens across 1,270 model turns, with 66% of a $1,278 bill going to re-sent cached context. Output was 14%.
That session was not primarily doing code review. A review-heavy session, re-reading a full codebase on every feedback turn, would push the input dominance further.
The Concordia report frames it directly: "The primary cost of agentic software engineering lies not in initial code generation but in the iterative, conversational process of refinement and verification."
The Trust Problem Compounds the Cost Problem#
Cost is one half of the problem. The other is that review quality degrades as context grows. Kilo, the VS Code-based coding agent, processed more than 40 trillion tokens across three million downloads and drew a pointed conclusion from its own usage data: task size should be bounded by what a human can review in a single sitting. If the output cannot be reviewed, it will not be merged.
A sprawling diff produced by a poorly scoped agent is not a cost problem alone , it is a trust problem. Sourcegraph's analysis of 1,281 agent runs found that the gap between success and failure had almost nothing to do with model capability. It had to do with context access and infrastructure. The same task took an agent 96 tool calls over 84 minutes without proper retrieval tooling, and five calls in under five minutes with it.
More context passed to a model does not produce more accurate review. At some point it produces noise, and humans rubber-stamp the result because the diff is too large to read carefully.
Where This Leaves Engineering Teams Today#
GitHub moved Copilot coding agents to token-based billing. Anthropic has shifted toward consumption-based API pricing. Microsoft canceled Claude Code licenses across its Windows, M365, and Surface engineering organizations. Uber burned its entire 2026 AI budget in four months. The billing model is no longer flat. Every review turn is a line item.
The Concordia researchers suggest inserting a human checkpoint before the iterative code review loop begins as one mitigation. That helps with cost. It does not fix the structural problem, which is that conversational agents re-read the entire codebase on each turn regardless of how much of it is relevant to the current diff.
Context engineering , loading only what is needed, compressing history, applying strict retrieval thresholds , is the practical discipline. But even aggressive context management does not change the fact that a conversational review loop is the wrong architecture for deterministic checks.
Why Code Review Is the Right Place to Specialize#
Security findings, hardcoded secrets, dead code, correctness errors on changed functions , these are not ambiguous. They do not require reading the entire codebase on every turn. They require reading the diff and the relevant context once, running structured checks, and producing a result that can be verified independently.
Hyrax covers six agent domains , security, code quality, reliability, API and data, ops, and UX , on every diff, using a 13-step verification process before submitting a PR. The scope is the changed code and its dependencies, not the full repository re-read on every review comment. The Concordia 59.4% number describes what happens when review is done conversationally by a general-purpose agent. Deterministic, diff-scoped review is a different architecture, and it is what the verification gate post describes in detail.
The cost argument and the trust argument converge at the same point: review is the bottleneck, and the bottleneck is expensive because the architecture is wrong. Specialization is not a premature optimization at this stage. It is the correct response to a known structural failure.
Hyrax is live at hyrax.dev.