Real-world results · Updated 2026-04-20

Benchmarks & Model Selection.

Which model should you pick, and with what flags? Measured against 5 vulnerable-app benchmarks (DVNA, PyGoat, REALCODE, vuln-php, DSVW) across three on-device models and three scan configurations.

Pick a preset.

Match the first row that describes your situation. Each includes the exact command.

Default

Balanced full-project audit

Fastest stack that runs Phase 6 end-to-end. Evidence-heuristic relabel recovers category drift.

SecureReview-7B (M7)

$ foil scan --deep ./project
CI/CD

Hundreds-of-files sweep, speed matters

2–3× faster than Qwen 7B at comparable recall on common OWASP classes.

SecureReview-7B (M7)

$ foil scan ./project
Coverage

First scan of an unknown codebase

Catches rare classes M7 misses: SSTI, MD5-as-crypto, Insufficient Logging, Vulnerable Components.

Qwen 2.5-Coder-7B

$ foil model activate qwen-coder-7b && foil scan --deep ./project
Deep audit

Critical module, want every edge case

Best reasoning quality, worth the latency for targeted work.

Qwen 2.5-Coder-14B

$ foil model activate qwen-coder-14b && foil scan --deep ./module
16 GB RAM

Mac with 16 GB unified memory

14B model won't fit. M7 works but misses rare classes — pick Qwen 7B for breadth.

Qwen 2.5-Coder-7B

$ foil model activate qwen-coder-7b
Reference

Reproducing published benchmark numbers

Qwen 7B is the reference model for every documented number in the full report.

Qwen 2.5-Coder-7B

$ foil scan ./project

Three models, compared.

All measurements taken end-to-end on identical benchmark projects. Speed is per-LLM-call.

AxisQwen 7B
reference
Qwen 14B
deepest reasoning
SecureReview-7B
fastest · default
RAM footprint~5 GB~9 GB~5 GB
Min Mac RAM16 GB32 GB16 GB
Speed per call
8–15s
15–30s
4–5s
DVNA (19 vulns)
HIGH conf≥0.9
14/19
not re-measured
25 HIGH
PyGoat (17 classes)
15/17
historically highest
13/17
REALCODE (5 IDORs)
strict category
3/5
not tested
4/5
Rare-class recall
SSTI, logging, MD5
StrongStrongestWeak
Category drift riskLowLowestMedium

The three-axis trade-off.

Throughput, breadth of vulnerability classes, and precision. No model wins on all three.

Throughput

Scans per hour

SecureReview-7B95%
Qwen 7B55%
Qwen 14B25%

Breadth

Vuln class coverage

Qwen 14B95%
Qwen 7B92%
SecureReview-7B70%

Precision

Correct category + low FP

Qwen 14B95%
Qwen 7B80%
SecureReview-7B75%

Percentages are relative positions derived from measured metrics (speed spread, benchmark class counts, precision stack effectiveness). Absolute numbers in the model table above.

DSVW 5-way head-to-head.

Same target, five scan configurations. DSVW is a 98-line Python file with 26 documented vulns spanning SQLi, XXE, SSRF, XSS, XPath, CSRF, deserialization, and more. Small target — fast iteration and real differentiation.

ConfigHIGH conf≥0.9ClassesDurationBest for
M7 --deep641m41sBalanced — Phase 6 relabels IDORs
M7 simple1112sSpeed sanity check (expect under-reporting)
Qwen 7B --deep862m56sMax recall — catches XXE, Path Traversal
Qwen 7B simple7557sReference benchmarking
Qwen 7B --no-guided-json8558sWhen guided_json is suspected to hurt recall

Key takeaway:

M7 --deep catches the common OWASP classes fast and auto-relabels access-control bugs (3 Injection → IDOR on DSVW). Qwen 7B --deepis the highest-recall config — it's the only one that finds XXE and Path Traversal on this target, at 1.8× the latency.

Scan configurations.

Three flavours, same model. Pick by the kind of audit you're running.

simple

Default scan · Phases 1–5

Code map → function review → auth logic → attack surface → data flow. Auth context and app summary injected into every handler prompt.

JSON schema: on

When:Fast scan, want to see what's there before investing in Phase 6.

Recommended--deep

Full audit · Phases 1–6

Adds Phase 6 ReAct investigation on HIGH findings. Tools inspect callees, trace variable origins, verify or dismiss with citations. +30–90s per investigated finding.

JSON schema: on

When: Full audit. Auto-relabels category drift (Injection → IDOR) via evidence heuristics.

--no-guided-json

Diagnostic · Phases 1–5

Disables JSON schema enforcement. Closest to pre-M10 V2 behaviour. Measured DSVW impact is marginal (+1 finding over simple).

JSON schema: off

When:You suspect the schema is fighting the model's category vocabulary.

Phase 6 covers both logic (IDOR, broken auth, broken access) and taint-flow (SQLi, command injection, path traversal, SSRF, XXE, insecure deserialization) categories. It can relabel or dismiss a finding with a concrete code citation. Full CLI reference: docsfoil.peachstudio.be/cli/scan.

Full benchmark board.

End-of-day head-to-head, 2026-04-19. Best result per benchmark highlighted.

BenchmarkQwen 7BM7 full stackWinner
DVNA14/19 HIGH, 50 findings25 HIGH, 39 findingsM7detection up, noise down
PyGoat15/17 classes13/17 (12 HIGH + 1 MED)Qwen 7Bbroader category coverage
REALCODE3/5 strict4/5 strictM7Phase 6 auto-relabel
vuln-php24 findings / 6 HIGH / all 11 levelsM7Qwen not re-run
Scan speed~8–15s~4–5sM72–3× faster

Methodology & caveats.

  • Measurement window: all numbers are end-of-day 2026-04-19. Qwen 7B is the reference model used for every previously published metric; M7 = SecureReview-7B quantized 4-bit MLX.
  • Non-determinism:DVNA results vary ±2–3 findings per run at temperature 0.1. Other benchmarks are similarly noisy. Trends matter; exact counts don't.
  • Out of scope: known-CVE dependency scanning (use npm audit / Snyk), runtime DAST, business-logic design flaws, dynamic runtime routing.
  • Upgrade path: foil model activate <name> then foil server restart-engine.

Ready to scan?

Install via Homebrew and try the default stack on one of your own projects. Swap models when your situation calls for it.