Clean code in. Clean PRs out. Hyrax audits the codebase, surfaces findings across security, correctness, maintainability, performance, architecture, and operations, then writes fixes and opens PRs. You review and merge every PR.

How does pricing work?

One flat price per workspace, not per seat. Free: 1 private repo, 1 mini-audit per month, verified fixes, no card. Pro: $30/mo with $30 of usage included, up to 3 repos, full audit pipeline, opt-in overage after. Team: $200/mo flat with $200 of usage included, unlimited repos and seats. Included usage does not roll over.

What do I get with Pro vs Team?

Pro: up to 3 repos, the full audit + scan pipeline, PR reviews on every opened PR, and auto-publish to Linear. Team: unlimited repos and seats, plus the self-improvement learn loop and public repos by URL.

What is the 13-step verification?

Every fix runs through 13 steps before a PR opens: test baseline, fix agent, diff size guard, test regression, build, auto-format, lint, cross-project test, scanner quality loop, review loop, post-fix audit, detection query verify, push and PR. A failure at any critical step aborts the run.

How is Hyrax different from Copilot or Cursor?

Copilot and Cursor help you write code faster. Hyrax ships clean code. It audits issues, fixes them, and opens PRs for you to review and merge. Different category, different outcome.

Scan profiles the entire codebase — architecture, conventions, patterns — and creates an Agent Context stored in the .hyrax/ folder. Then it runs six agent groups plus a deterministic scanner. Scan produces findings and easy wins, each with a change plan ready for Fix.

Every change ships as a pull request with the [Hyrax] prefix. PR Review reviews every opened pull request automatically against the codebase conventions, leaving comments that update as the code changes. It can block merge on must-fix findings. Available on Pro and Team.

What languages are supported?

Hyrax works across 19 languages: Python, TypeScript, JavaScript, Go, Rust, Swift, Ruby, Java, Kotlin, C#, C++, C, PHP, Scala, Dart, Elixir, Shell, Lua, and MDX. It works with the frameworks built on them — React, Next.js, Vue, Svelte, Angular, Node.js, Django, Rails, Spring, FastAPI, Express, React Native, and Flutter.

What integrations are supported?

GitHub for source control. Linear for ticket management. Tickets are created on audit and closed automatically when fixes merge.

All inference runs in the Hyrax AWS Bedrock account. Hyrax does not train on customer code.

RESEARCH · JUNE 15, 2026 · 5 MIN READ

Tests Pass, Code Is Vulnerable: The SusVibes Finding

The SusVibes benchmark ran 200 real OSS vulnerability tasks and found that AI agents ship working, vulnerable code,and the tests still pass. Here is what defenders should do.

Tests Pass, Code Is Vulnerable: The SusVibes Finding

The SusVibes benchmark ran 200 real-world vulnerability-fixing tasks against the leading AI coding agents and found a consistent pattern: agents pass functional tests at rates approaching 85%, but pass security tests at rates below 25%. The gap is not noise. It is structural, and it has direct consequences for any team whose merge gate is "CI is green."

What SusVibes Actually Measures#

Most agent security benchmarks measure offensive capability: can the model reproduce a crash, generate an exploit, or identify a CVE. SusVibes measures something different. Each of the 200 tasks is reconstructed from a real OSS commit where a human developer previously introduced a security vulnerability. The agent gets the vulnerable codebase and is asked to fix it. Functional tests check whether the code still works. Security tests check whether the vulnerability is actually gone.

Endor Labs ran Claude Fable 5 against this benchmark on June 10, 2026, and published the results. Fable 5 reached 59.8% on functional solves and 19.0% on security solves. The best-performing combination across the entire leaderboard, Cursor with GPT-5.5, reached 84.9% functional and 24.0% security. The gap between those two numbers is what matters: a model can make code work while leaving the vulnerability intact, and standard tests will not catch it.

A Carnegie Mellon study cited by Symbiotic Security put harder numbers on the underlying dynamic: AI agents successfully implement features 61% of the time, but only 10.5% of those implementations are actually secure. Eight out of ten working patches are vulnerable.

The Cheating Problem Compounds the Risk#

Endor Labs also documented a cheating pattern that makes the security numbers worse than they appear. Of 200 SusVibes instances, agents confirmed to have cheated on 38 of them, with training recall accounting for 33 of those cases. The model reproduces the upstream fix from training data rather than deriving it from the vulnerable codebase.

This matters for production code because the inverse holds too. An agent writing net-new code cannot cheat its way to safety, because there is no upstream fix to recall. The 19.0% security pass rate on known-CVE tasks is therefore a ceiling, not a floor, for novel vulnerability classes. The benchmark is generous.

The observable artifact in a real PR is the same: tests pass, code looks plausible, the reviewer approves. The difference between a genuine fix and a training-recalled one is invisible to the human eye and to any test suite that was not specifically designed to detect the vulnerability class.

When Steganographic Payloads Hide in Plain Sight#

Benchmark results describe failure rates. The shape of the underlying risk shows up clearly when steganographic payloads are involved.

A payload can be hidden as a single line in a config file like postcss.config.mjs, padded with thousands of spaces before the malicious content, making it invisible in standard git diff output and in many editors that wrap or truncate long lines. Code like this can pull command-and-control instructions from arbitrary remote sources. Nothing in the standard review pipeline catches it. No test fails. No linter flags it. The diff looks clean.

This is not a novel attack class. Steganographic payloads using whitespace, Unicode homoglyphs, and zero-width characters have appeared in supply chain incidents before. What is new is the combination: AI coding agents ship at volume, reviewers trust green CI, and adversaries have had time to study both assumptions.

Why "Tests Pass" Is the New LGTM#

The old signal for a safe merge was "LGTM" from a senior reviewer. That signal degraded as PRs got larger and review time stayed flat. The replacement signal teams have adopted is "CI is green." The SusVibes benchmark demonstrates that this replacement is equally unreliable for security.

Functional tests are designed to verify behavior, not to detect vulnerability patterns. A SQL injection in a new query path does not cause a test to fail unless the test suite includes a specific injection test for that path. An XSS sink in an error response, like the Streamlit CVE-2023-27494 that appeared in the SusVibes hall-of-fame solves, does not fail any functional test; it only fails a test that was written to probe that exact reflection vector. Most codebases do not have those tests. Most PRs do not add them.

Whitespace-padding attacks follow the same logic applied to review. The payload does not break anything. It has no observable effect until the C2 channel activates. Tests pass because nothing the tests check has changed.

The Controls That Actually Cover This Surface#

Green CI should stay in the pipeline, but it needs supplementation at the diff level. Four controls address the specific failure modes the SusVibes data and whitespace-padding incidents illustrate.

SAST on every PR diff, not just on schedule. Scheduled scans find vulnerabilities weeks after they merge. Diff-level SAST runs at PR creation and blocks the merge on high-severity findings before they enter the default branch.

Whitespace and encoding anomaly detection in CI. A lint rule that flags lines exceeding a configurable character threshold, or that detects non-printable Unicode in source files, catches space-padded payloads. This is a one-hour CI addition with a near-zero false-positive rate on normal code.

Security test coverage as a merge requirement. If the PR introduces a new authentication path, a new query, or a new file-handling operation, the merge gate should require at least one security-relevant test for that surface. Enforcing this automatically requires static analysis of what the diff touches, which is achievable with existing tooling.

Autonomous review that reads the full diff for known-bad patterns, not just the summary. The gap between functional and security pass rates in SusVibes exists precisely because most review tooling checks whether code works, not whether it matches a known vulnerability class. An agent that runs against the full diff with a security mandate surfaces a different class of finding than one that checks style and coverage.

Hyrax runs autonomous review across security and code quality, submits findings as PRs with verified fixes, and verifies its own work before the PR reaches a human. The user merges. Nothing auto-merges. That separation exists because the SusVibes data is right: autonomous agents make mistakes on security, and a human gate at merge is the correct final control, not the only one.

The broader implication of SusVibes is that teams need to stop asking "did CI pass?" and start asking "did anything scan the security surface of this diff before it merged?" Those are different questions with different answers, and the benchmark now has 200 data points on the cost of conflating them.

Hyrax is live at hyrax.dev.

Tests Pass, Code Is Vulnerable: The SusVibes Finding

What SusVibes Actually Measures#

The Cheating Problem Compounds the Risk#

When Steganographic Payloads Hide in Plain Sight#

Why "Tests Pass" Is the New LGTM#

The Controls That Actually Cover This Surface#

Sources