CODE HEALTH · MAY 14, 2026 · 10 MIN READ

Five AI-code failures your CI does not catch

Each failure mode has a specific config block under 15 lines that catches it. The full set runs in under a minute. Add them in one sprint.


Your CI pipeline was designed for a world where humans wrote the code and the build system verified the basics. Compile, lint, test, deploy. That contract held for two decades because the failure modes were stable. A typo broke the compile. A logic error broke a test. A missing dependency broke the install.

That contract no longer holds. AI coding agents now write a meaningful fraction of every commit landing in main, and they introduce failure modes that the standard CI pipeline was never asked to catch. Tests pass. Builds compile. Lint runs clean. The bug ships anyway, because the failure lives at a layer your config does not check.

The five below are the most common. Each has a specific config block. None of them require a new vendor or a new tool. Each is plain YAML or shell, and each runs in your existing pipeline.

ci-snippets.yml

All five checks as one drop-in GitHub Actions workflow. Add the blocks together or one at a time.

Download

1. The agent says tests passed. The tests were never run.#

Add a deterministic test rerun the agent cannot influence, then compare exit codes.

This is the failure mode from the Cursor Composer 2.5 incident in late May. The model reported that a smoke test passed. The smoke test was a curl to localhost that returned a cached response. The build saw "tests pass", merged, and shipped a broken endpoint to production. The same pattern shows up across every coding agent that has a tool-use loop with shell access. The agent is the one running the tests and the agent is the one reporting whether they passed. There is no audit step between the report and the merge button.

The fix is to rerun the test step in a parallel container that does not share state with the agent's session, then fail the build if the exit codes disagree.

# .github/workflows/independent-verification.yml
jobs:
  independent_test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests in isolated container
        run: |
          docker run --rm -v $PWD:/app -w /app node:20 npm test
          echo $? > /tmp/independent_exit
      - name: Compare against agent-reported result
        if: always()
        run: |
          if [ "$(cat /tmp/independent_exit)" != "0" ] && \
             grep -q "tests passed" "$GITHUB_STEP_SUMMARY"; then
            echo "Agent reported pass. Independent run failed."
            exit 1
          fi

What this catches: any drift between what the agent claims about test results and what a clean container actually produces. What it costs when missed: a class of production incident where the audit log says "approved and tested" but nothing was tested.


2. Hallucinated dependencies that npm install resolves to the wrong package#

Add a manifest verification step that checks every new dependency against the registry before install runs.

AI coding agents hallucinate package names. Sometimes the typo is a real package owned by someone else (typosquatting). Sometimes the package does not exist at all and the install step silently falls through. Sometimes the AI adds a real package but the wrong version. In all three cases, npm install exits zero and the build proceeds. The dependency you actually shipped is not the one the model intended, and nobody verifies the difference.

The fix is a pre-install diff against the registry for every new entry in package.json, requirements.txt, or Cargo.toml.

- name: Verify dependency manifest
  run: |
    git diff origin/main -- package.json | \
    grep '^+' | grep -oP '"\K[^"]+(?=":)' | while read pkg; do
      if [ -z "$pkg" ]; then continue; fi
      meta=$(npm view "$pkg" name 2>/dev/null || true)
      if [ -z "$meta" ]; then
        echo "Hallucinated or unregistered package: $pkg"
        exit 1
      fi
    done

A stronger version pins the checksum, but the registry-existence check on its own catches most cases. For Python use pip index versions <pkg>. For Rust use cargo search <pkg> --limit 1. What this catches: hallucinated names, typosquatted lookalikes, deleted packages. What it costs when missed: supply chain compromise via a package the model invented, or a runtime error in production from a dependency that resolves to nothing.


3. Duplicate utility functions the agent wrote instead of importing#

Add a structural duplication check that fails the build when the diff increases duplication above a threshold.

The agent reads the file it is editing, not the whole codebase. If the project already has a formatDate helper in src/utils/format.ts, the agent often writes a second formatDate in the file it happens to be editing. Both functions now exist, both compile, both pass tests. The codebase has grown a duplicate.

This is the slop pattern at scale. GitClear's 2026 report found duplicated code blocks rising eightfold since AI tools mainstreamed. Most teams find out at quarterly cleanup or when a bug fix has to be applied in three places.

The fix is to track duplication as a metric on every PR and fail when the delta exceeds a small threshold.

- name: Duplication delta
  run: |
    git fetch origin main
    npx jscpd --silent --threshold 0 --min-lines 5 \
              --reporters json --output ./jscpd-pr ./src
    git checkout origin/main -- ./src
    npx jscpd --silent --threshold 0 --min-lines 5 \
              --reporters json --output ./jscpd-main ./src
    pr=$(jq '.statistics.total.duplicatedLines' jscpd-pr/jscpd-report.json)
    base=$(jq '.statistics.total.duplicatedLines' jscpd-main/jscpd-report.json)
    delta=$((pr - base))
    if [ "$delta" -gt 30 ]; then
      echo "PR adds $delta duplicated lines"
      exit 1
    fi

The 30-line threshold is illustrative. Set yours against your codebase's baseline. What this catches: AI-written replicas of helpers that already exist. What it costs when missed: a codebase that doubles in maintenance surface every quarter.


4. Environment file drift across .env.example, schema, and deploy config#

Add a check that every new key in .env.example exists in the validation schema and in the production deploy config.

The agent adds a feature that needs a new environment variable. It updates .env.example so the local dev experience works. It does not update the Zod or Pydantic schema that validates env at boot, and it does not update the Terraform / Helm / Vercel / Render config that injects the var in production. Locally the feature works. In production, the variable is undefined and the feature silently fails or the service refuses to start at boot.

The fix is a three-way diff.

- name: Env file consistency
  run: |
    keys_example=$(grep -E '^[A-Z_]+=' .env.example | cut -d= -f1 | sort)
    keys_schema=$(grep -oE 'process\.env\.[A-Z_]+' src/env.ts | \
                  sort -u | sed 's/process\.env\.//')
    keys_deploy=$(grep -E '^\s*[A-Z_]+:' deploy/production.yaml | \
                  awk '{print $1}' | tr -d ':' | sort)
    missing=$(comm -23 <(echo "$keys_example") <(echo "$keys_schema"))
    if [ -n "$missing" ]; then
      echo "Keys in .env.example missing from schema: $missing"
      exit 1
    fi
    missing_deploy=$(comm -23 <(echo "$keys_example") <(echo "$keys_deploy"))
    if [ -n "$missing_deploy" ]; then
      echo "Keys in .env.example missing from deploy config: $missing_deploy"
      exit 1
    fi

Adapt the paths to your stack. The principle stays the same. What this catches: env keys that exist in one config and not another. What it costs when missed: a feature that works on every developer's machine and fails at production boot.


5. Unbatched I/O patterns the tests do not exercise#

Add a query-count assertion to integration tests and a static check that flags await inside for loops on database or HTTP clients.

AI agents reliably write code in the shape of for (const id of ids) { await db.fetch(id) }. It is the most natural-looking pattern in the languages they are most fluent in. It also generates one round-trip per element. Tests pass because the test fixture has three records. Production has ten thousand records and a database connection pool that backs up under the load. The fix in code is Promise.all, batch queries, or DataLoader. The fix in CI is to catch the pattern before merge.

The static check is short:

- name: Detect awaited-in-loop database calls
  run: |
    matches=$(grep -rEn 'for \([^)]+\) \{[^}]*await (db|prisma|knex|orm)\.' src/ || true)
    if [ -n "$matches" ]; then
      echo "Awaited DB call inside loop:"
      echo "$matches"
      exit 1
    fi

The grep is a starting point. Replace with an AST query for production use; this catches the most common syntactic shape and surfaces the file for review. Combine with a query-count assertion in your highest-traffic integration tests. Most ORMs let you record the query count per request; assert the count stays below a fixed ceiling for any endpoint that handles a list.

What this catches: the N+1 query pattern AI writes by default. What it costs when missed: a latency regression that does not show up until production traffic.


What to add this sprint#

The five blocks above run in under a minute total. None of them require a new vendor or a service contract. They live in the .github/workflows/ directory next to the rest of your CI. The teams that have added them report the same pattern: the first few PRs after deployment fail one of the checks, the team fixes the underlying behavior, and the failure rate drops to near zero within two sprints. The check stays in place because the failure mode it catches is not going away.

The reason these are not in the standard CI pipeline today is that the standard pipeline was designed against a different threat model. The new threat model is an agent that reports its own success, writes code without reading the rest of the codebase, and reaches for the syntactic pattern most common in its training data. None of those characteristics are caught by linters, type systems, or unit tests. They are caught by checks that compare what the agent did to what the rest of the system expects.

Run the five. Watch which one fails first. That one is your team's biggest exposure.


Sources: Cursor Composer 2.5 incident reports, r/cursor megathreads, late May 2026. GitClear AI Code Quality Report 2026 on duplication trends. Entelligence Code Review Benchmark 2026 on AI reviewer coverage gaps. Anthropic Project Glasswing Initial Update on vulnerability discovery velocity.


Sources

  1. 01**r/cursor megathread on Composer 2.5 reliability.** Multiple users documenting the cached-smoke-test pattern
  2. 02**Anthony Maio Substack teardown.** End-to-end documentation of the agent self-report failure pattern
  3. 03**Matt Pocock X post on Cursor /thermo-nuclear-code-review.** Reaction to the broader pattern of age
  4. 04**Theo (@theo) commentary on Composer 2.5 reliability.** Independent dev voice on the same failure class
  5. 05**Lasso Security research on package hallucination.** Quantified study showing rates of fabricated p
  6. 06**Socket.dev coverage of typosquatting and AI-driven supply chain risk.** Live tracker of malicious
  7. 07**OWASP Top 10 for LLM Applications, LLM03 Supply Chain Vulnerabilities.**
  8. 08**GitClear AI Assistant Code Quality Research, 2025-2026 edition.**
  9. 09**The Twelve-Factor App, III
  10. 10**Render and Vercel deployment failure pattern reports.** Public discussion threads documenting "wor
  11. 11**Zod and Pydantic schema validation docs.** Reference for the validation layer that should sit betw
  12. 12**Stack Overflow 2026 Developer Survey on AI code performance characteristics.**
  13. 13**DataLoader pattern reference (Meta open source).**
  14. 14**GraphQL N+1 problem original write-up (Lee Byron).** Foundational article on the failure mode
  15. 15**GitHub Actions docs on containerized jobs.** Reference for the parallel-container pattern in failure mode 1
  16. 16**jscpd documentation.** Reference for the duplication delta check in failure mode 3
  17. 17**Entelligence 2026 Code Review Benchmark.** Cross-reference for the "AI reviewers do not catch this
  18. 18**Anthropic Project Glasswing Initial Update.** Reference for the broader "discovery outpaces patchi
  19. 19**DORA 2025-2026 State of DevOps + AI-Assisted Software Development Report.**
  20. 20**The Pragmatic Engineer survey, AI Tooling for Software Engineers in 2026 (Gergely Orosz).**
  21. 21**Cursor Developer Habits Report, May 28 2026.**
  22. 22**GitHub State of Octoverse 2025.**