RESEARCH · JUNE 24, 2026 · 6 MIN READ

FrontierCode: 86.6% of AI PRs Fail the Merge Bar

Cognition's FrontierCode benchmark scores whether maintainers would merge an AI PR. Claude Opus 4.8 leads at 13.4%. The other 86.6% land somewhere.

FrontierCode: 86.6% of AI PRs Fail the Merge Bar

Cognition published FrontierCode on June 8, 2026, and the headline result is plain: Claude Opus 4.8, the best-performing model on the benchmark's hardest subset, scores 13.4%. That means 86.6% of its pull requests would be rejected by the senior OSS maintainers who built the evaluation criteria. The benchmark does not ask whether the code compiles. It asks whether a real maintainer would merge it.

What FrontierCode Actually Measures#

SWE-bench and its variants score whether generated code passes tests or matches a reference solution. FrontierCode scores something harder: whether the PR-shaped output would survive a genuine review. Cognition recruited 20-plus OSS maintainers from 36 repositories, each spending more than 40 hours per task to distill their judgment into concrete evaluation criteria. The definition of "mergeable" came from people who actually approve commits.

The benchmark has three nested difficulty subsets: Extended with 150 tasks, Main with the 100 hardest, Diamond with the 50 hardest. Scores on Diamond are where the real signal lives. On Extended, Claude Opus 4.8 reaches 51.8%, which sounds tolerable until you notice that roughly half its output still fails a maintainer's bar on the easier tasks.

The Four Dimensions That Kill AI PRs#

FrontierCode does not just check behavior. Four dimensions separate a passing-tests patch from a mergeable one.

Scope discipline. The benchmark uses automatic checks for diff size, file boundaries, and semantic locality. Models that refactor adjacent code while fixing a bug fail this criterion, even when the refactor is correct in isolation. A maintainer reviewing a targeted bug fix does not want to evaluate an unrequested cleanup at the same time.

Test quality. FrontierCode's "reverse-classical" method runs the agent's own test suite against the original broken codebase. If the tests do not fail on the unfixed code, the agent did not understand the problem well enough to test for it. This catches a specific failure: tests written to pass, not tests written to verify.

Style adherence. Cognition's example involves a C++ task where Claude Opus 4.8 mixed LOG_WARNING() and std::cerr calls across multi-line messages. Both produce identical runtime behavior. Only one follows codebase convention. FrontierCode flags it; SWE-bench-style evaluation would not. The distinction matters because the non-conforming version bakes in an assumption about LOG_WARNING() that may not hold if the logging layer changes later.

Regression safety. Adaptive classical grading uses an LLM to adjust reference tests for valid alternative implementations, avoiding false failures from superficial differences in function names or error strings. This keeps the signal clean while still catching actual regressions.

The Benchmark Comparison That Matters#

Cognition reports that FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro, validated through METR's prior finding that high-scoring models on existing benchmarks frequently generate patches real maintainers reject. The misclassification errors run both directions: false positives where incomplete test suites reward wrong solutions, and false negatives where overly specific tests penalize valid alternatives.

The Diamond scores across all tested models tell a consistent story. Claude Opus 4.8 at 13.4%. GPT-5.5 at 6.3%. Gemini 3.1 Pro at 4.7%. Kimi K2.6, the leading open-source model, at 3.8%. The benchmark is not saturated. There is no ceiling effect distorting the rankings. The scores reflect genuine current capability on the dimension that matters for production code.

Why This Is Specifically an Autonomous Code Governance Problem#

The gap FrontierCode quantifies is not a generation problem. The models can generate code. The gap is a verification problem: confirming that what was generated meets the four dimensions above before anyone reviews it.

That is where the bottleneck moved. Code volume is no longer the constraint. The constraint is the cost of validating each PR against scope, test integrity, style, and regression safety before a human has to evaluate it. At current agent output rates, that validation cannot be manual at scale.

Hyrax runs 13 verification steps across six agent domains before submitting a PR. The domains map directly onto what FrontierCode penalizes: code quality agents enforce style consistency and idiomatic patterns; reliability agents check regression safety; the 13-step process includes scope checks that confirm fixes stay within the intended surface area. The PR Hyrax submits has already been evaluated against the dimensions that FrontierCode shows current models fail 86.6% of the time. The user merges. Hyrax does not auto-merge.

What Teams Should Do With These Numbers#

FrontierCode Diamond is unsaturated at 13.4%. That number will improve. But the path from 13% to production-trustworthy requires progress on context utilization, reliable scope discipline, and requirements integration , none of which is imminent. Teams building review workflows around current AI PR output are absorbing risk that does not show up in test-pass rates.

Two adjustments are defensible today. First, treat scope as a separate, explicit check. FrontierCode's scope criterion is cheap to implement: define which files a PR should touch and verify the diff stays inside that boundary. Second, require that new tests fail on the unfixed baseline before they count as coverage. Both checks are mechanical and do not depend on model capability improving.

The 86.6% rejection rate is not a forecast of future performance. It is a measurement of current state. Building governance workflows around that measurement, rather than around SWE-bench numbers that do not predict merge outcomes, is the practical response.

Hyrax is live at hyrax.dev.

FrontierCode: 86.6% of AI PRs Fail the Merge Bar

What FrontierCode Actually Measures#

The Four Dimensions That Kill AI PRs#

The Benchmark Comparison That Matters#

Why This Is Specifically an Autonomous Code Governance Problem#

What Teams Should Do With These Numbers#

Sources