How AI Code Review Works (And Where It Falls Short)
A technical breakdown of how AI tools review pull requests — what they catch, what they miss, and how to use them without getting burned.
AI code review tools have gone from novelty to mainstream in the past 18 months. CodeRabbit, Greptile, GitHub Copilot Code Review, and DevReview all promise the same thing: a senior engineer's eye on every PR, in 60 seconds, for less than the cost of a coffee per dev.
But how do they actually work? And — more importantly — where do they reliably fail? This post is a technical walkthrough, plus an honest assessment of when AI review is genuinely useful and when it's noise.
The pipeline (most tools follow this shape)
Strip away the marketing and almost every AI code reviewer does these six steps:
- Webhook trigger. Your VCS (GitHub, GitLab, etc.) fires an event when a PR is opened or updated.
- Diff fetch. The tool pulls the changed files — usually as a unified diff, sometimes with limited surrounding context.
- Filter & chunk. Generated files (lockfiles, minified bundles) get removed; large files are truncated to fit within the LLM's context window.
- Prompt construction. The diff is wrapped in a system prompt that instructs the model on review priorities, output format, and what not to flag.
- LLM inference. The model returns structured output (usually JSON) listing issues, severities, and line numbers.
- Comment posting. The tool maps issues back to specific lines via the diff and posts inline comments through the VCS API.
What the LLM is actually good at
Modern frontier models (Claude Sonnet 4.5, GPT-4o, Gemini 1.5 Pro) catch a real, repeatable set of issues in code:
- Common bug patterns: off-by-one errors, missing null checks, swapped arguments, copy-paste mistakes where one variable wasn't renamed.
- Security smells: SQL injection via string concatenation, hardcoded secrets, insecure crypto (MD5, ECB), missing CSRF on state-changing endpoints.
- Resource leaks: missing
defer close()in Go, unclosed file handles in Python, missing try-with-resources in Java. - Concurrency hazards: goroutine leaks, race conditions on shared maps, missing
awaiton Promises. - API misuse: calling
useEffectconditionally, mutating React state directly, awaiting in a synchronous Python function.
These are exactly the kinds of issues a tired senior reviewer might miss at 4pm on a Friday. AI doesn't get tired.
What the LLM is reliably bad at
Don't kid yourself: there are entire classes of bugs that current AI reviewers cannot catch, and pretending otherwise will burn you.
Cross-file logic
Most tools only see the diff. If your PR changes auth.ts in a way that breaks middleware.ts 200 lines away in an unchanged file, the AI doesn't see the conflict. Greptile partially solves this with full-repo indexing, but indexing has costs: slower setup, higher prices, and stale-context issues when the index lags behind the code.
Business logic correctness
The AI doesn't know what your code is *supposed* to do. If you write a tax calculator that uses the wrong formula, the AI will check that the formula is implemented correctly — not whether the formula itself is right. Specs and intent live in your head and your tickets, not in the diff.
Architectural drift
A PR that introduces a circular dependency or violates a layering convention often looks fine in isolation. The AI flags individual lines, not patterns spanning files and modules. Architecture review remains a human job.
Hallucinated issues
Even the best models occasionally invent problems that aren't real — a confident claim that a function is missing a null check when the call site already guarantees non-null, for example. This is the single biggest source of dev frustration with AI review. Good prompt engineering reduces it; nothing eliminates it.
Rule of thumb: if an AI comment says "this might fail when X," verify X actually happens in your codebase. Don't blindly fix what the AI flags. About 5-15% of comments from current tools are false positives.
The prompt is the whole product
Here's a dirty secret of the AI code review industry: the underlying model is interchangeable. Anyone can call Claude or GPT-4. What separates good tools from noisy ones is the system prompt — the instructions that tell the model:
- What severity ranking to use (BUG > WARN > NIT)
- What kinds of issues to skip (formatting a linter would catch, missing tests on minor changes)
- How many comments to produce (more is not better)
- How to format output (JSON schema, line numbers in the new file)
- Language-specific patterns to watch for
A poorly tuned prompt will return 30 comments per PR — half of them about whitespace. A well-tuned prompt will return 2-3 high-signal comments and stay quiet on the rest. Which experience do you want?
How to evaluate an AI reviewer
Before paying for any tool, run this experiment:
- Pick 5 of your team's recent merged PRs — ideally with a mix of routine work and one that had a real bug caught in human review.
- Run each tool you're evaluating on the PRs (most have free trials).
- Score each tool on: *did it catch the real bug?* and *how many noisy comments did it generate?*
- Pick the tool with the best signal-to-noise ratio for *your* codebase.
Don't pick based on landing page features or pricing alone. Code review quality is stylistic and language-dependent — a tool that works great on a Go backend may be useless on a TypeScript frontend.
Where DevReview fits
DevReview uses Claude Sonnet 4.5 with a prompt tuned over hundreds of test PRs to be aggressive about real bugs and quiet about nits. We hard-cap reviews at 8 comments and tell the model to cut nits before bugs when over budget. We support all major languages with language-specific check guidance.
We're not the right tool if you need GitLab/Bitbucket support, full-repo context, or a fortune-500 enterprise contract. We *are* the right tool if you're a solo dev or a 2-10 person team that wants fast, focused PR review for $9-29/month.
Try DevReview free for 14 days →Try DevReview free
14-day trial · 5 free reviews/mo after that · No credit card required.
Start Free TrialRead more
10 PR Review Mistakes AI Can Catch (That Humans Miss When They're Tired)
Even strong reviewers miss SQL injection patterns, missing timeouts, and goroutine leaks at 5pm on a Friday. Here are 10 categories where an AI second-pass earns its keep.
Stop Paying $24/Dev for AI Code Review When You're a 3-Person Team
Per-seat pricing makes sense for 50-person engineering orgs. For solo devs and small teams, it's a tax for features you don't use.
Self-Hosted vs SaaS Code Review: A 2026 Decision Guide
Should you run your AI code review tool yourself or use a SaaS? A practical comparison of cost, security, latency, and operational burden — with the cases each option makes sense for.