vuln-analyzer-qwen-lora

QLoRA fine-tune of Qwen2.5-Coder-7B-Instruct for vulnerability description generation.

Part of the CodeSentinel project.

What this model does

Takes a raw code snippet and produces a structured one-sentence vulnerability description in a consistent format that a downstream classifier (RoBERTa) can reliably classify into a CWE category.

Output format:

This function performs <operation> on <input> without <missing check>,
which may allow an attacker to <impact>.

This model is not a CWE classifier — it's a code-to-description translator. Classification is handled by martynattakit/vuln-classifier-roberta.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter    = "martynattakit/vuln-analyzer-qwen-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model     = PeftModel.from_pretrained(model, adapter)

SYSTEM = (
    "You are a security analyst. Given a code snippet, produce exactly one "
    "structured sentence describing the vulnerability it contains.\n\n"
    "Format: \"This function performs <operation> on <input> without "
    "<missing check>, which may allow an attacker to <impact>.\"\n\n"
    "Be specific about the operation and the missing check. Do not add any other text."
)

code = '''
def get_user(username):
    query = "SELECT * FROM users WHERE name = '" + username + "'"
    return db.execute(query)
'''

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": f"Analyze this code:\n\n```\n{code}\n```"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=120, do_sample=False)

new_tokens = output[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))
# → This function constructs a SQL query by concatenating user-controlled input
#   without parameterization, which may allow an attacker to inject arbitrary
#   SQL commands and access or modify the database.

Training

Parameter	Value
Base model	Qwen2.5-Coder-7B-Instruct
Method	QLoRA (4-bit NF4)
LoRA rank	16
LoRA alpha	32
Training samples	1,596 (BigVul, filtered to MITRE Top 25)
Epochs	3
Learning rate	2e-4
Batch size	2 (grad accum 8, effective 16)
Hardware	Kaggle T4 x2
Final eval loss	0.067

Training data

BigVul — real-world C/C++ vulnerable functions from open source projects, filtered to MITRE CWE Top 25 classes.

Template-generated target descriptions were used as training outputs — this ensures consistent output format that the downstream RoBERTa classifier can reliably process.

Limitations

Training data is primarily C/C++ — descriptions for Python/JS/Go may be less precise
Template-based targets mean descriptions follow a fixed pattern — may not capture all nuances
This model alone does not classify CWEs — use with vuln-classifier-roberta for full pipeline
Requires GPU for practical inference (4-bit quantization via bitsandbytes)

Full pipeline

Code input → vuln-analyzer-qwen-lora → structured description
                                      → vuln-classifier-roberta → CWE ID + confidence

See CodeSentinel for the full working demo.

License

MIT

Downloads last month: 33

Model tree for martynattakit/vuln-analyzer-qwen-lora

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Adapter

(631)

this model

martynattakit
/

vuln-analyzer-qwen-lora