RESEARCH · JUNE 15, 2026 · 5 MIN READ
Tests Pass, Code Is Vulnerable: The SusVibes Finding
The SusVibes benchmark ran 200 real OSS vulnerability tasks and found that AI agents ship working, vulnerable code,and the tests still pass. Here is what defenders should do.
Tests Pass, Code Is Vulnerable: The SusVibes Finding
The SusVibes benchmark ran 200 real-world vulnerability-fixing tasks against the leading AI coding agents and found a consistent pattern: agents pass functional tests at rates approaching 85%, but pass security tests at rates below 25%. The gap is not noise. It is structural, and it has direct consequences for any team whose merge gate is "CI is green."
What SusVibes Actually Measures#
Most agent security benchmarks measure offensive capability: can the model reproduce a crash, generate an exploit, or identify a CVE. SusVibes measures something different. Each of the 200 tasks is reconstructed from a real OSS commit where a human developer previously introduced a security vulnerability. The agent gets the vulnerable codebase and is asked to fix it. Functional tests check whether the code still works. Security tests check whether the vulnerability is actually gone.
Endor Labs ran Claude Fable 5 against this benchmark on June 10, 2026, and published the results. Fable 5 reached 59.8% on functional solves and 19.0% on security solves. The best-performing combination across the entire leaderboard, Cursor with GPT-5.5, reached 84.9% functional and 24.0% security. The gap between those two numbers is what matters: a model can make code work while leaving the vulnerability intact, and standard tests will not catch it.
A Carnegie Mellon study cited by Symbiotic Security put harder numbers on the underlying dynamic: AI agents successfully implement features 61% of the time, but only 10.5% of those implementations are actually secure. Eight out of ten working patches are vulnerable.
The Cheating Problem Compounds the Risk#
Endor Labs also documented a cheating pattern that makes the security numbers worse than they appear. Of 200 SusVibes instances, agents confirmed to have cheated on 38 of them, with training recall accounting for 33 of those cases. The model reproduces the upstream fix from training data rather than deriving it from the vulnerable codebase.
This matters for production code because the inverse holds too. An agent writing net-new code cannot cheat its way to safety, because there is no upstream fix to recall. The 19.0% security pass rate on known-CVE tasks is therefore a ceiling, not a floor, for novel vulnerability classes. The benchmark is generous.
The observable artifact in a real PR is the same: tests pass, code looks plausible, the reviewer approves. The difference between a genuine fix and a training-recalled one is invisible to the human eye and to any test suite that was not specifically designed to detect the vulnerability class.
When Steganographic Payloads Hide in Plain Sight#
Benchmark results describe failure rates. The shape of the underlying risk shows up clearly when steganographic payloads are involved.
A payload can be hidden as a single line in a config file like postcss.config.mjs, padded with thousands of spaces before the malicious content, making it invisible in standard git diff output and in many editors that wrap or truncate long lines. Code like this can pull command-and-control instructions from arbitrary remote sources. Nothing in the standard review pipeline catches it. No test fails. No linter flags it. The diff looks clean.
This is not a novel attack class. Steganographic payloads using whitespace, Unicode homoglyphs, and zero-width characters have appeared in supply chain incidents before. What is new is the combination: AI coding agents ship at volume, reviewers trust green CI, and adversaries have had time to study both assumptions.
Why "Tests Pass" Is the New LGTM#
The old signal for a safe merge was "LGTM" from a senior reviewer. That signal degraded as PRs got larger and review time stayed flat. The replacement signal teams have adopted is "CI is green." The SusVibes benchmark demonstrates that this replacement is equally unreliable for security.
Functional tests are designed to verify behavior, not to detect vulnerability patterns. A SQL injection in a new query path does not cause a test to fail unless the test suite includes a specific injection test for that path. An XSS sink in an error response, like the Streamlit CVE-2023-27494 that appeared in the SusVibes hall-of-fame solves, does not fail any functional test; it only fails a test that was written to probe that exact reflection vector. Most codebases do not have those tests. Most PRs do not add them.
Whitespace-padding attacks follow the same logic applied to review. The payload does not break anything. It has no observable effect until the C2 channel activates. Tests pass because nothing the tests check has changed.
The Controls That Actually Cover This Surface#
Green CI should stay in the pipeline, but it needs supplementation at the diff level. Four controls address the specific failure modes the SusVibes data and whitespace-padding incidents illustrate.
SAST on every PR diff, not just on schedule. Scheduled scans find vulnerabilities weeks after they merge. Diff-level SAST runs at PR creation and blocks the merge on high-severity findings before they enter the default branch.
Whitespace and encoding anomaly detection in CI. A lint rule that flags lines exceeding a configurable character threshold, or that detects non-printable Unicode in source files, catches space-padded payloads. This is a one-hour CI addition with a near-zero false-positive rate on normal code.
Security test coverage as a merge requirement. If the PR introduces a new authentication path, a new query, or a new file-handling operation, the merge gate should require at least one security-relevant test for that surface. Enforcing this automatically requires static analysis of what the diff touches, which is achievable with existing tooling.
Autonomous review that reads the full diff for known-bad patterns, not just the summary. The gap between functional and security pass rates in SusVibes exists precisely because most review tooling checks whether code works, not whether it matches a known vulnerability class. An agent that runs against the full diff with a security mandate surfaces a different class of finding than one that checks style and coverage.
Hyrax runs autonomous review across security and code quality, submits findings as PRs with verified fixes, and verifies its own work before the PR reaches a human. The user merges. Nothing auto-merges. That separation exists because the SusVibes data is right: autonomous agents make mistakes on security, and a human gate at merge is the correct final control, not the only one.
The broader implication of SusVibes is that teams need to stop asking "did CI pass?" and start asking "did anything scan the security surface of this diff before it merged?" Those are different questions with different answers, and the benchmark now has 200 data points on the cost of conflating them.
Hyrax is live at hyrax.dev.