Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.

Part 3 ranked 5 AI models by overall vulnerability rate. But when we broke the data down by security domain — database, auth, file I/O, command execution — the rankings inverted. The 'worst' model fixes 93% of database vulnerabilities. The 'best' model fails at remediation. Aggregate numbers hide domain expertise.

12 min read
Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.
Share:

Every AI benchmark I've seen makes the same mistake. They rank models by a single number — accuracy, pass rate, vulnerability rate — and call it a day.

In Part 3, we did exactly that. We ranked 5 models from Claude and Gemini by aggregate vulnerability rate and declared Haiku the safest (49%) and Gemini Pro the most dangerous (73%).

That ranking is real. It's also misleading.


TL;DR

When we broke 700 functions down by security domain, the rankings inverted. The model that "lost" the aggregate benchmark dominates the most important remediation category. The model that "won" has one of the lowest fix rates.

Category Champions (Lowest Vulnerability Rate)

DomainChampionRateRunner-UpRate
DatabaseHaiku 4.539%Opus 4.661%
AuthenticationHaiku 4.529%Sonnet 4.539%
File I/OGemini 2.5 Pro86%Haiku / Opus93%
ConfigurationGemini 2.5 Flash21%Sonnet / Opus25%
Command ExecutionHaiku 4.550%Sonnet 4.575%

Remediation Champions (Highest Fix Rate)

DomainChampionFix RateRunner-UpFix Rate
DatabaseGemini 2.5 Pro93%Gemini Flash67%
AuthenticationOpus 4.6100%Gemini Pro58%
File I/OOpus 4.673%Haiku 4.558%
ConfigurationFlash / Opus100%Sonnet 4.543%
Command ExecutionOpus 4.619%Haiku 4.57%

No single model wins everywhere. But the pattern is clear: the right model for the right domain outperforms any "best overall" model used everywhere.

Skip to: The Domain Breakdown | Remediation by Domain | Net Security Position | The Practical Framework | Reproduce This


Why Aggregates Fail

Part 3's aggregate ranking:

text
1. Haiku 4.5:     49%  ← "safest"
2. Sonnet 4.5:    62%
3. Gemini Flash:  64%
4. Opus 4.6:      65%
5. Gemini Pro:    73%  ← "most dangerous"

This is the equivalent of saying "Hospital A has the best patient outcomes" without checking whether it's a cardiac center or a dermatology clinic. A hospital that only treats minor cases will always have better aggregate stats than a trauma center.

AI models work the same way. Different models generate fundamentally different types of code — varying in complexity, architectural patterns, and feature richness. Haiku generates simple, minimal implementations. Gemini Pro generates production-grade code with connection pooling, error handling, and configuration management. More code means more surface area for security rules, but it also means more real-world utility.

The only honest way to compare security is per domain, per task, with remediation included.


The Five Security Domains

1. Database Operations (PostgreSQL)

Prompts: getUserById, searchUsers, updateUser, deleteUser

ModelVuln RateKey Vulnerabilities
Haiku 4.539% 🏆pg/no-select-all
Opus 4.661%pg/no-select-all, detect-object-injection
Sonnet 4.571%pg/no-select-all, pg/no-unsafe-query
Gemini 2.5 Flash75%pg/no-hardcoded-credentials, pg/prefer-pool-query
Gemini 2.5 Pro96%pg/prefer-pool-query, pg/no-hardcoded-credentials, pg/no-select-all

Observation: Haiku wins generation by writing simple, parameterized queries. Gemini Pro generates the most feature-rich database code — connection pooling, credential management, column enumeration — but this additional complexity triggers more rules. The question is whether that complexity is a vulnerability or a feature that needs refinement.

2. Authentication (JWT, bcrypt)

Prompts: generateJWT, verifyJWT, hashPassword, comparePassword

ModelVuln RateNotable
Haiku 4.529% 🏆Minimal JWT payloads
Sonnet 4.539%jwt/no-sensitive-payload
Gemini 2.5 Flash43%0/7 on generateJWT — perfect score
Gemini 2.5 Pro43%0/7 on hashPassword and comparePassword
Opus 4.650%7/7 on generateJWT — always vulnerable

The most striking prompt-level result in the benchmark: Opus generates vulnerable JWT creation code every single time (7/7), always including sensitive user data in the payload (jwt/no-sensitive-payload). Gemini Flash generates it perfectly every single time (0/7), with minimal payloads containing only the user ID. Same prompt, opposite outcomes, 100% consistency.

3. File I/O (Uploads, Reads, Deletes)

Prompts: readUpload, saveUpload, listDirectory, deleteFile

ModelVuln RateKey Vulnerabilities
Gemini 2.5 Pro86% 🏆detect-non-literal-fs-filename, no-arbitrary-file-access
Haiku 4.593%Same rules
Opus 4.693%Same rules
Gemini 2.5 Flash96%Same rules
Sonnet 4.5100%Same rules — every iteration, every time

The hardest category for every model. File operations with user-supplied filenames will almost always trigger detect-non-literal-fs-filename. This isn't a model failure — it's an architectural constraint. Any function that takes a dynamic filename parameter and passes it to fs.readFile() will flag this rule. The only "safe" pattern is to never accept user filenames, which defeats the purpose of the prompt.

Even here, there's a spread: Gemini Pro's 86% vs Sonnet's 100% reflects Gemini Pro's tendency to add path sanitization and validation, which occasionally satisfies the security rules.

4. Command Execution (Shell Operations)

Prompts: compressFile, convertImage, runCommand, backupDatabase

ModelVuln RateKey Vulnerabilities
Haiku 4.550% 🏆detect-child-process, detect-non-literal-fs-filename
Sonnet 4.575%Same
Gemini 2.5 Flash82%Same
Gemini 2.5 Pro93%Same
Opus 4.696%Same

Haiku's simplicity advantage is clearest here. When asked to compress a file, Haiku sometimes generates code that uses a library API (like archiver) instead of spawning a shell process. The larger models generate shell commands with child_process.exec() — more flexible, but inherently flagged by security rules.

5. Configuration & Secrets

Prompts: dbConnection, sendEmail, apiCall, encryptData

ModelVuln RateKey Vulnerabilities
Gemini 2.5 Flash21% 🏆Rarely hardcodes credentials
Opus 4.625%no-hardcoded-credentials
Sonnet 4.525%Same
Haiku 4.532%no-hardcoded-credentials
Gemini 2.5 Pro46%no-hardcoded-credentials, no-unsafe-deserialization

Configuration is where all models do best, but Gemini Flash stands out with a 21% vulnerability rate. Flash consistently generates code that reads from process.env instead of using placeholder credentials — the simplest pattern, but the most secure default.


The Remediation Story, Per Domain

Generation is only half the pipeline. When vulnerabilities are found, we feed the ESLint violations back to the same model and ask it to fix them. This is where the rankings invert most dramatically.

Database Remediation — The Biggest Surprise

ModelVulnerable FunctionsFully FixedFix Rate
Gemini 2.5 Pro272593%
Gemini 2.5 Flash211467%
Sonnet 4.5201365%
Opus 4.6171059%
Haiku 4.511545%

The model with the highest database vulnerability rate (96%) also has the highest database fix rate (93%). Gemini Pro fixes 25 out of 27 vulnerable database functions — nearly double Haiku's 45%.

This pattern makes more sense than it first appears. Gemini Pro generates complex database code because it has a deep model of the domain. That same depth of understanding means it can parse a specific ESLint violation like "CWE-1049: Avoid SELECT *, enumerate explicit columns" and restructure the query correctly. Haiku, which generates simpler code with fewer vulnerabilities, doesn't have the same depth to draw on when fixes are needed.

Authentication Remediation — Opus Dominates

ModelVulnerable FunctionsFully FixedFix Rate
Opus 4.61414100%
Gemini 2.5 Pro12758%
Sonnet 4.511545%
Haiku 4.58338%
Gemini 2.5 Flash12325%

This is the most dominant single-category result in the entire benchmark. Opus fixes every single authentication vulnerability when given feedback — 14 for 14, a perfect score. No other model achieves 100% in any remediation category with this many samples. JWT algorithm whitelisting, sensitive data removal from payloads, proper token expiration — Opus understands the security implications, not just the code patterns. If your application is authentication-heavy, Opus is the only model where remediation is effectively solved.

File I/O Remediation — Everyone Struggles

ModelVulnerable FunctionsFully FixedFix Rate
Opus 4.6261973%
Haiku 4.5261558%
Gemini 2.5 Pro241042%
Sonnet 4.5281036%
Gemini 2.5 Flash27622%

Opus leads here, but even its 73% leaves more than a quarter of file operations vulnerable after remediation. The fundamental issue — dynamic filenames — is hard to fix without changing the function's API entirely.

Command Execution — Nobody Wins

ModelVulnerable FunctionsFully FixedFix Rate
Opus 4.627519%
Haiku 4.51417%
Sonnet 4.52115%
Gemini 2.5 Flash2314%
Gemini 2.5 Pro2600%

The most sobering category. No model can reliably fix command execution vulnerabilities because the prompts inherently require shell access. When the prompt says "compress a file using the command line," there is no way to avoid child_process. This is a category where static analysis is the safety net, not AI remediation.

Configuration Remediation — Two Perfect Scores

ModelVulnerable FunctionsFully FixedFix Rate
Gemini 2.5 Flash66100%
Opus 4.677100%
Sonnet 4.57343%
Gemini 2.5 Pro13538%
Haiku 4.59222%

Both Gemini Flash and Opus achieve perfect configuration remediation. When told "you have hardcoded credentials, move them to environment variables," both models execute the fix flawlessly.


Net Security: The Metric That Actually Matters

The most useful metric isn't vulnerability rate or fix rate in isolation — it's the net security position after a full generation + remediation cycle.

ModelInitial Vuln RateFix RateNet RemainingRank Change
Opus 4.665.0%60.4%25.7%⬆️ 4th → 1st
Haiku 4.548.6%38.2%30.0%⬇️ 1st → 2nd
Sonnet 4.562.1%36.8%39.3%— stays 2nd tier
Gemini 2.5 Pro72.9%46.1%39.3%⬆️ 5th → ties 3rd
Gemini 2.5 Flash63.6%33.7%42.1%— stays 3rd tier

The entire ranking inverts after remediation. Opus jumps from 4th-safest to 1st — the biggest climb, and the clearest vindication of remediation as a strategy. Haiku drops from 1st to 2nd. And Gemini Pro — the "most dangerous" model by aggregate — climbs from 5th to tie for 3rd, matching Sonnet.

Opus's quiet dominance: It doesn't win generation in any category. It doesn't have the flashiest single-domain result. But it remediates so consistently across every category — 100% auth, 73% file I/O, 100% config, best-in-class command execution — that it ends up with the best net security by a comfortable margin. Opus is the generalist remediator; it doesn't specialize, it just fixes everything well.

Absolute Vulnerability Elimination

Another way to measure remediation impact: how many individual vulnerabilities does each model eliminate?

ModelVulns FoundAfter FixEliminatedReduction Rate
Gemini 2.5 Pro167937444.3%
Gemini 2.5 Flash154906441.6%
Sonnet 4.5139875237.4%
Haiku 4.5128775139.8%
Opus 4.6111624944.1%

Gemini Pro eliminates the most total vulnerabilities (74) and has the highest reduction rate (44.3%), narrowly edging Opus (44.1%). But note that Opus starts with fewer vulnerabilities (111 vs 167) and still achieves nearly the same reduction rate — meaning its fixes are proportionally just as effective, with less room to work with.


The Practical Framework

Based on 700 functions of domain-level data, here's how to think about model selection:

If You Have No Remediation Pipeline

Use the generation champion: Haiku 4.5. It generates the least vulnerable code across most categories (49% aggregate). Accept that ~50% of functions will still need manual review.

If You Have ESLint + Automated Remediation

The calculation changes. Now you care about net security after the full cycle:

StrategyNet PositionCost
Haiku everywhere30.0% remaining$
Opus everywhere25.7% remaining$$$
Domain-aware selection< 25% remaining$$

Domain-Aware Selection (Optimal)

Match models to their strengths:

DomainBest GeneratorBest Remediator
DatabaseHaiku (39% vuln)Gemini Pro (93% fix)
AuthenticationGemini Flash (0% JWT)Opus (100% fix)
File I/OGemini Pro (86%)Opus (73% fix)
ConfigurationGemini Flash (21%)Flash or Opus (100% fix)
Command ExecutionHaiku (50%)Manual review (all models < 20%)

This isn't theoretical — the Gemini CLI's -p flag and the Claude CLI's --print flag both support scriptable, zero-context execution that can be integrated into CI/CD pipelines:

bash
# Example: Domain-aware generation + remediation

# Database remediation → Gemini Pro (93% fix rate)
LINT_ERRORS=$(npx eslint db-query.js --format json)
if [ $? -ne 0 ]; then
  cd $(mktemp -d) && gemini -p --model gemini-2.5-pro \
    "Fix these ESLint violations: $LINT_ERRORS"
fi

# Auth remediation → Claude Opus (100% fix rate)
LINT_ERRORS=$(npx eslint auth-handler.js --format json)
if [ $? -ne 0 ]; then
  cd $(mktemp -d) && claude --print \
    "Fix these ESLint violations: $LINT_ERRORS"
fi

# JWT generation → Gemini Flash (0/7 vuln — perfect)
cd $(mktemp -d) && gemini -p --model gemini-2.5-flash \
  "Write a JWT generation function"

What This Changes About Part 3's Conclusions

Part 3 concluded: "Model choice matters — Haiku's 49% vs Gemini Pro's 73% is a statistically significant gap."

That conclusion stands. But it's incomplete without domain context:

Part 3 ConclusionPart 4 Refinement
"Haiku is the safest model"Haiku is the safest generator — but has the lowest database fix rate (45%)
"Gemini Pro is the most dangerous model"Gemini Pro generates complex code — and fixes 93% of it when given feedback
"Opus is the best overall remediator"Opus is the best overall — but Gemini Pro beats it in database remediation by 34 percentage points
"Model choice is a risk lever"Domain-aware model choice is a much larger risk lever

The aggregate vulnerability rate is a useful first-pass metric. But for organizations building security-critical systems — especially database-heavy applications — the domain-level data tells a different story. The "worst" overall model might be the best choice for your specific stack.


Limitations

Everything from Part 3's limitations applies, plus:

  1. Small category samples. Each category has 4 prompts × 7 iterations = 28 data points per model. Category-level confidence intervals are wider than the aggregate. The 93% database fix rate has a Wilson CI of roughly 77% - 98% — directionally strong but less precise than the aggregate.
  2. No cross-model remediation. We tested each model remediating its own code. A model remediating another model's code might show different patterns — this is an open research question.
  3. Category definitions are arbitrary. "Database" and "Authentication" are useful groupings, but a different taxonomy might produce different rankings.

Reproducing This Research

All category-level data was extracted from the same overnight benchmark results used in Part 3. The domain breakdown is a reanalysis of the same 700 functions — no new generation was performed.

bash
git clone https://github.com/ofri-peretz/eslint-benchmark-suite
cd eslint-benchmark-suite

# Run the full benchmark yourself
node benchmarks/ai-security/run-antigravity.js \
  --model=haiku-4.5,opus-4.6,sonnet-4.5,gemini-2.5-flash-cli,gemini-2.5-pro-cli \
  --iterations=7

Conclusions

  1. Aggregate benchmarks hide domain expertise. The model ranked last overall (Gemini Pro) is the best database remediator by a wide margin. The model ranked first (Haiku) has one of the lowest fix rates.
  2. No single model wins everywhere. Haiku leads generation in 3/5 categories. Opus leads remediation in 3/5. Gemini Flash leads generation in 1 and ties remediation in 1. Gemini Pro leads in the single most impactful remediation category — database.
  3. Remediation inverts the ranking. After a full generation + remediation cycle, Opus (initially 4th) becomes 1st, and Gemini Pro (initially 5th) ties for 3rd. The generation ranking is not the net security ranking.
  4. The 93% database fix rate is the benchmark's strongest signal. No other model achieves > 67% in any single remediation category except Opus's 100% authentication fix rate (but with only 14 samples). Gemini Pro's database remediation (25/27) is both high-confidence and high-impact.
  5. Command execution remediation is unsolved. Every model scores below 20%. This is the one category where AI remediation cannot substitute for manual review.
  6. Domain-aware model selection beats "use the best model." Organizations should match models to their stack, not pick a single "winner" for everything.

📦 Full Benchmark Results (JSON) 🔬 Benchmark Runner Source 📊 Overnight Runner Script

⭐ Star on GitHub


The Interlace ESLint Ecosystem 332+ security rules. 18 specialized plugins. 100% OWASP Top 10 coverage.

Explore the Documentation


In the AI Security Benchmark Series:

Follow @ofri-peretz to get notified when the next chapter drops.


Build Securely. I'm Ofri Peretz, a Security Engineering Leader and the architect of the Interlace Ecosystem.

ofriperetz.dev | LinkedIn | GitHub

Built with Nuxt UI • © 2026 Ofri Peretz