Every team I talk to has the same problem. AI tools tripled their PR volume, but their CI pipeline still runs the same lint-and-test suite from 2023. The result: reviewers drowning in 400-line diffs full of code nobody asked for. Here is exactly how we fixed it.
What makes AI-generated code different from human-written code?
Before building gates, you need to understand what you are gating against. AI-generated code fails differently than human-written code. Humans write bugs in logic. AI writes bugs in assumptions.
The three failure modes we see most often:
First, hallucinated imports. The model references packages that do not exist or uses API methods that were deprecated two versions ago. Your tests might still pass if the hallucinated import is in an unused code path. Your linter might not catch it if the package name looks plausible.
Second, unnecessary code generation. You ask for a login form, you get a login form plus a password reset flow, an email verification system, and a user settings page. The code works. Nobody asked for it. It inflates your diff, your bundle, and your review burden.
Third, confident duplication. The model generates a utility function that already exists in your codebase, but with a slightly different name. You now have formatDate and formatDateString doing the same thing. Neither is wrong. Both create maintenance debt.
What do you need before setting this up?
- GitHub Actions (or any CI that supports custom steps — GitLab CI, CircleCI all work with minor syntax changes)
- Node.js 18+ in your CI runner
- ESLint already configured in your project
- TypeScript (recommended but not required — some gates work with plain JS)
- About 2 hours for initial setup, 30 minutes per gate after that
If you are on a different CI platform, the concepts are identical. Only the YAML syntax changes. I will use GitHub Actions because that is what most of our projects run.
How do you detect hallucinated imports in CI?
This is the highest-value gate. A hallucinated import that slips through code review will blow up in production with a MODULE_NOT_FOUND error that your test suite probably missed because the import is in a code path your tests do not exercise.
The approach: after npm install, scan every changed file for import statements and verify each imported module actually resolves.
Create a script at scripts/check-imports.sh:
The script extracts all import paths from changed files using a simple grep, filters out relative imports (those are caught by TypeScript), and checks that every bare-specifier import resolves from node_modules. We also check for known deprecated or renamed packages against a small deny list we maintain.
In your GitHub Actions workflow:
Add a step after your install step that runs the import checker. Set continue-on-error to false. If a hallucinated import is found, the step should exit with code 1 and print the offending file and import path. We format the output as GitHub Actions annotations so the error shows inline on the PR diff.
This gate catches about 15% of AI-generated PRs at Fordel. That number surprised us. Fifteen percent of PRs reference packages that do not exist or use wrong import paths. Most of these would have been caught eventually — but "eventually" means a developer wastes 20 minutes debugging a MODULE_NOT_FOUND in staging.
How do you catch AI-generated code that nobody asked for?
This is the gate that generates the most debate. Developers push back because "the AI added useful stuff." Maybe. But unrequested code has three costs: review time, maintenance burden, and bundle size. If you did not ask for it, it should not be in the PR.
We use a diff-size heuristic with a twist. Instead of a hard line-count limit, we compare the PR description (what was requested) against the actual diff (what was delivered).
The implementation:
A GitHub Action step runs after checkout and calculates three metrics: total lines added, number of new files created, and number of new exports added. If any of these exceed configurable thresholds — we use 300 lines added, 3 new files, or 10 new exports — the step posts a comment asking the author to justify the scope. It does not block the PR. It flags it.
This is important: the gate is advisory, not blocking. Blocking on diff size creates perverse incentives. Developers will split PRs artificially or suppress AI output before committing, both of which waste time. An advisory comment that says "This PR adds 847 lines and 7 new files. The PR description mentions a login form. Is everything here intentional?" is more effective than a red X.
How do you detect duplicate utilities the AI just reinvented?
This one is subtle and harder to automate perfectly. The AI generates formatCurrency when you already have formatMoney in your utils. Both work. Neither is wrong. But now you have two functions doing the same thing and the next developer (or AI) will pick whichever one they find first.
Our approach: we maintain an export registry — a file that lists every exported function, class, and constant in the project, generated automatically by a script that runs on main after every merge.
The CI gate:
When a PR adds new exports, the gate runs a fuzzy match against the existing export registry. We use a combination of Levenshtein distance and semantic similarity on function names. If a new export is suspiciously similar to an existing one — formatCurrency vs formatMoney, validateEmail vs checkEmail, parseDate vs extractDate — the gate flags it with a comment linking to the existing function.
The fuzzy matching is not perfect. It produces false positives about 20% of the time. We accept that tradeoff because the true positives save significant maintenance effort downstream. A 20% false positive rate on an advisory comment is annoying. A 20% false positive rate on a blocking gate is unacceptable. Keep this advisory.
- npm run export-registry — regenerates the registry from the current codebase
- Registry is a JSON file at .ci/export-registry.json committed to the repo
- Update it in the same CI run that merges to main, so it stays current
- Exclude test files and stories from the registry — they are allowed to have duplicates
How do you enforce dead code elimination for generated code?
AI tools generate helper functions that the main code never calls. These accumulate. After a month, you have 30 unused functions scattered across your codebase, each generated by a model that was confident they would be needed.
TypeScript helps here. If you have strict mode enabled, unused locals are already caught. But unused exports — functions that are exported but never imported anywhere — slip through.
We use a two-pass approach:
First pass: ts-prune or knip identifies unused exports across the entire project. This runs on every PR but only reports findings for files changed in the PR. We do not want to surface pre-existing dead code on every PR — that creates noise fatigue.
Second pass: for new files created by the PR, we check that every exported function is actually imported somewhere in the PR or in the existing codebase. A new file with 5 exported functions where only 2 are imported is a signal that the AI generated a "utility module" that is mostly unused.
How do you wire all five gates into one workflow without slowing CI down?
The biggest mistake: running gates sequentially. Each gate takes 10-30 seconds, but sequentially that adds up to 2+ minutes. Run them in parallel.
The workflow structure:
Your GitHub Actions workflow should have a single job called ai-code-quality that runs in parallel with your existing test and lint jobs, not after them. This job checks out the code, installs dependencies (cached), and runs all five gates as parallel steps using a matrix strategy or as background processes in a single step.
We use a single composite action that runs all gates and collects results, then posts one consolidated comment on the PR instead of five separate comments. Nobody wants five bot comments on their PR. One comment with sections is easier to process.
| Gate | Type | Runtime | False Positive Rate |
|---|---|---|---|
| Hallucinated Imports | Blocking | ~8 seconds | < 5% |
| Scope Creep Detection | Advisory | ~12 seconds | ~25% |
| Duplicate Export Detection | Advisory | ~20 seconds | ~20% |
| Dead Export Detection | Advisory | ~15 seconds | ~10% |
| Diff Complexity Score | Advisory | ~5 seconds | ~15% |
Total added CI time: under 90 seconds when run in parallel, since the longest gate (duplicate detection) takes about 20 seconds.
What is the fifth gate — the diff complexity score?
I mentioned five gates but only detailed four. The fifth is a complexity score that combines cyclomatic complexity, nesting depth, and cognitive complexity for every function added in the PR.
AI-generated code tends to have a specific complexity signature: low cyclomatic complexity (few branches) but high cognitive complexity (deeply nested callbacks, long function bodies, many parameters). This is because models optimize for "works on the first try" rather than "easy to understand later."
We use ESLint with the complexity and max-depth rules, but only report findings for functions added in the PR. The gate posts the complexity score for each new function alongside the project average. If a new function has 3x the average complexity, the comment highlights it. Again — advisory, not blocking.
What are the most common mistakes when building these gates?
We made all of these. Learn from our mistakes.
Making everything blocking. This is the number one mistake. Developers will game blocking gates. They will delete code to stay under thresholds, split PRs artificially, or disable the gate with skip comments. Only the hallucinated imports gate should be blocking because it catches genuine build failures. Everything else is advisory.
Running gates on all files instead of changed files. Your CI bill will thank you for scoping gates to the PR diff. Full-repo analysis belongs in a nightly job, not in every PR.
Posting too many bot comments. Consolidate into one comment per PR. Update it on force-push instead of posting a new one. We use the peter-evans/create-or-update-comment action for this.
Not tuning thresholds for your team. Our thresholds — 300 lines, 3 new files, 10 new exports — work for our projects. Yours will be different. Start with loose thresholds and tighten based on data. Track your false positive rate for the first month.
Ignoring the output. The biggest meta-mistake: building gates nobody reads. If your team ignores the advisory comments after two weeks, the gates are too noisy. Tune them or remove them. A gate that is always ignored is worse than no gate — it teaches developers to ignore CI feedback entirely.
“The purpose of a quality gate is not to block bad code. It is to make invisible problems visible before a human reviewer spends 30 minutes discovering them manually.”
What do you actually end up with?
After implementing all five gates, here is what changes in practice.
Your reviewers stop discovering hallucinated imports during review — CI catches them. Your PR comments shift from "did we need this file?" to substantive architecture feedback because scope creep is already flagged. Your codebase stops accumulating duplicate utilities because developers see the advisory comment and consolidate before merging.
The total investment is about a day of engineering time. The ongoing cost is near zero — we have adjusted thresholds twice in three months. The return is measurable: our average PR review time dropped from 45 minutes to 28 minutes after deploying these gates, because reviewers no longer spend time on problems that automation handles.
Where does this go from here?
We are experimenting with two additions. First, an LLM-powered gate that reads the PR description and the diff, then flags when the diff does not match the description. This is the scope creep gate but smarter — instead of counting lines, it actually understands intent. Early results are promising but the 15-second latency and cost per PR make it hard to justify for smaller teams.
Second, a historical complexity tracker that shows whether your codebase complexity is trending up or down over time, broken down by human-written vs AI-generated code. We are building this as a dashboard rather than a CI gate. The goal is team awareness, not PR-level enforcement.
The underlying principle: AI coding tools changed the speed of code production. Your quality infrastructure needs to match. Not by adding friction, but by making the right information visible at the right time.





