Qwen3-Coder-30B-A3B β€” DCR (Drupal Code Review) QLoRA adapter, v2

A LoRA adapter that specializes Qwen3-Coder-30B-A3B-Instruct for reviewing Drupal 10/11 PHP diffs and emitting structured JSON findings (security, logic, architecture, Drupal-API).

This is round 2. The v1 adapter (qwen3coder-30b-dcr-lora) scored well on synthetic held-out data but, when finally tested on real Drupal security defects, caught only 18.8% of them β€” it had over-learned "looks like clean merged Drupal β†’ clean". v2 adds 38 real, objective security defects (pre-fix code from Drupal security advisories, SA-CORE-*) plus low-severity contrastive pairs, and recovers real-defect recall to 56.2% while keeping 100% specificity.

Results: base vs v1 vs v2 (real-defect eval, n=32)

16 real CVE-grade defects (advisory fix commits, inverted so the diff reintroduces the vuln; objective ground truth) + 16 matched clean fixes. Same base weights, LoRA hot-swapped, temperature 0.

Metric Base v1 v2
Verdict accuracy 71.9% 59.4% 78.1%
Positive recall (caught the real defect) 87.5% (14/16) 18.8% (3/16) 56.2% (9/16)
Negative specificity (quiet on clean) 56.2% 100% 100%
Category match 56.2% β€” 43.8%
Invalid JSON 0/32 0/32 0/32

Honest read: v2 roughly tripled v1's real-defect recall without giving back specificity, and has the best overall verdict accuracy. It is not strictly better than base β€” base still out-recalls it (14/16 vs 9/16) on subtle logic bypasses, and v2's category labelling regressed. But base false-alarms on 7 of 16 clean fixes (specificity 56%), where v1 and v2 raise zero. Pick v2 for a low-false-positive pipeline; pick base if you want maximum recall and will triage the noise. Full report with verbatim side-by-side outputs (wins and losses) ships in the project repo under docs/eval/.

Training data

v1's 400 pairs + 38 real security positives (inverted SA-CORE fix commits, objective category/severity from the advisory) + matched clean negatives + 11 low-severity contrastive pairs (e.g. O(nΒ²) array_merge-in-loop with a near-miss clean form). 498 train rows; the real-defect eval set was held out by advisory ID. Teacher for the synthetic half: Claude Opus 4.x.

Usage (with the base model)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
m = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="bfloat16")
m = PeftModel.from_pretrained(m, "bartek-flp/qwen3coder-30b-dcr-lora-v2")

Prompt with the DCR system message (review a diff, output JSON findings only).

Limitations

QLoRA on attention projections only (q/k/v/o, r=16). Real-defect recall is 56%, with the remaining gap mostly subtle logic-level access bypasses that the base model catches but v2 does not. Category labelling is weaker than base. The eval is small (n=32) and security-skewed. Always keep a human in the loop for security findings.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bartek-flp/qwen3coder-30b-dcr-lora-v2

Adapter
(46)
this model