Real-world results · Updated 2026-04-20

Benchmarks & Model Selection.

Which model should you pick, and with what flags? Measured against 5 vulnerable-app benchmarks (DVNA, PyGoat, REALCODE, vuln-php, DSVW) across three on-device models and three scan configurations.

Pick a preset.

Match the first row that describes your situation. Each includes the exact command.

Default

Balanced full-project audit

Fastest stack that runs Phase 6 end-to-end. Evidence-heuristic relabel recovers category drift.

SecureReview-7B (M7)

$ foil scan --deep ./project

CI/CD

Hundreds-of-files sweep, speed matters

2–3× faster than Qwen 7B at comparable recall on common OWASP classes.

SecureReview-7B (M7)

$ foil scan ./project

Coverage

First scan of an unknown codebase

Catches rare classes M7 misses: SSTI, MD5-as-crypto, Insufficient Logging, Vulnerable Components.

Qwen 2.5-Coder-7B

$ foil model activate qwen-coder-7b && foil scan --deep ./project

Deep audit

Critical module, want every edge case

Best reasoning quality, worth the latency for targeted work.

Qwen 2.5-Coder-14B

$ foil model activate qwen-coder-14b && foil scan --deep ./module

16 GB RAM

Mac with 16 GB unified memory

14B model won't fit. M7 works but misses rare classes — pick Qwen 7B for breadth.

Qwen 2.5-Coder-7B

$ foil model activate qwen-coder-7b

Reference

Reproducing published benchmark numbers

Qwen 7B is the reference model for every documented number in the full report.

Qwen 2.5-Coder-7B

$ foil scan ./project

Three models, compared.

All measurements taken end-to-end on identical benchmark projects. Speed is per-LLM-call.

Axis	Qwen 7B reference	Qwen 14B deepest reasoning	SecureReview-7B fastest · default
RAM footprint	~5 GB	~9 GB	~5 GB
Min Mac RAM	16 GB	32 GB	16 GB
Speed per call	8–15s	15–30s	4–5s
DVNA (19 vulns) HIGH conf≥0.9	14/19	not re-measured	25 HIGH
PyGoat (17 classes)	15/17	historically highest	13/17
REALCODE (5 IDORs) strict category	3/5	not tested	4/5
Rare-class recall SSTI, logging, MD5	Strong	Strongest	Weak
Category drift risk	Low	Lowest	Medium

The three-axis trade-off.

Throughput, breadth of vulnerability classes, and precision. No model wins on all three.

Throughput

Scans per hour

SecureReview-7B95%

Qwen 7B55%

Qwen 14B25%

Breadth

Vuln class coverage

Qwen 14B95%

Qwen 7B92%

SecureReview-7B70%

Precision

Correct category + low FP

Qwen 14B95%

Qwen 7B80%

SecureReview-7B75%

Percentages are relative positions derived from measured metrics (speed spread, benchmark class counts, precision stack effectiveness). Absolute numbers in the model table above.

DSVW 5-way head-to-head.

Same target, five scan configurations. DSVW is a 98-line Python file with 26 documented vulns spanning SQLi, XXE, SSRF, XSS, XPath, CSRF, deserialization, and more. Small target — fast iteration and real differentiation.

Config	HIGH conf≥0.9	Classes	Duration	Best for
M7 --deep	6	4	1m41s	Balanced — Phase 6 relabels IDORs
M7 simple	1	1	12s	Speed sanity check (expect under-reporting)
Qwen 7B --deep	8	6	2m56s	Max recall — catches XXE, Path Traversal
Qwen 7B simple	7	5	57s	Reference benchmarking
Qwen 7B --no-guided-json	8	5	58s	When guided_json is suspected to hurt recall

Key takeaway:

M7 --deep catches the common OWASP classes fast and auto-relabels access-control bugs (3 Injection → IDOR on DSVW). Qwen 7B --deepis the highest-recall config — it's the only one that finds XXE and Path Traversal on this target, at 1.8× the latency.

Scan configurations.

Three flavours, same model. Pick by the kind of audit you're running.

simple

Default scan · Phases 1–5

Code map → function review → auth logic → attack surface → data flow. Auth context and app summary injected into every handler prompt.

JSON schema: on

When:Fast scan, want to see what's there before investing in Phase 6.

Recommended--deep

Full audit · Phases 1–6

Adds Phase 6 ReAct investigation on HIGH findings. Tools inspect callees, trace variable origins, verify or dismiss with citations. +30–90s per investigated finding.

JSON schema: on

When: Full audit. Auto-relabels category drift (Injection → IDOR) via evidence heuristics.

--no-guided-json

Diagnostic · Phases 1–5

Disables JSON schema enforcement. Closest to pre-M10 V2 behaviour. Measured DSVW impact is marginal (+1 finding over simple).

JSON schema: off

When:You suspect the schema is fighting the model's category vocabulary.

Phase 6 covers both logic (IDOR, broken auth, broken access) and taint-flow (SQLi, command injection, path traversal, SSRF, XXE, insecure deserialization) categories. It can relabel or dismiss a finding with a concrete code citation. Full CLI reference: docsfoil.peachstudio.be/cli/scan.

Full benchmark board.

End-of-day head-to-head, 2026-04-19. Best result per benchmark highlighted.

Benchmark	Target	Qwen 7B	M7 full stack	Winner
DVNA	Node.js / Express · 19 vulns	14/19 HIGH, 50 findings	25 HIGH, 39 findings	M7detection up, noise down
PyGoat	Python / Django · 17 classes	15/17 classes	13/17 (12 HIGH + 1 MED)	Qwen 7Bbroader category coverage
REALCODE	Next.js real-world · 5 IDORs	3/5 strict	4/5 strict	M7Phase 6 auto-relabel
vuln-php	PHP file upload · 11 levels	—	24 findings / 6 HIGH / all 11 levels	M7Qwen not re-run
Scan speed	Time per LLM call	~8–15s	~4–5s	M72–3× faster

Methodology & caveats.

Measurement window: all numbers are end-of-day 2026-04-19. Qwen 7B is the reference model used for every previously published metric; M7 = SecureReview-7B quantized 4-bit MLX.
Non-determinism:DVNA results vary ±2–3 findings per run at temperature 0.1. Other benchmarks are similarly noisy. Trends matter; exact counts don't.
Out of scope: known-CVE dependency scanning (use npm audit / Snyk), runtime DAST, business-logic design flaws, dynamic runtime routing.
Upgrade path: foil model activate <name> then foil server restart-engine.

Ready to scan?

Install via Homebrew and try the default stack on one of your own projects. Swap models when your situation calls for it.

Install Foil See plans