vuln-analyzer-qwen-lora

QLoRA fine-tune of Qwen2.5-Coder-7B-Instruct for vulnerability description generation.

Part of the CodeSentinel project.

What this model does

Takes a raw code snippet and produces a structured one-sentence vulnerability description in a consistent format that a downstream classifier (RoBERTa) can reliably classify into a CWE category.

Output format:

This function performs <operation> on <input> without <missing check>,
which may allow an attacker to <impact>.

This model is not a CWE classifier β€” it's a code-to-description translator. Classification is handled by martynattakit/vuln-classifier-roberta.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter    = "martynattakit/vuln-analyzer-qwen-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model     = PeftModel.from_pretrained(model, adapter)

SYSTEM = (
    "You are a security analyst. Given a code snippet, produce exactly one "
    "structured sentence describing the vulnerability it contains.\n\n"
    "Format: \"This function performs <operation> on <input> without "
    "<missing check>, which may allow an attacker to <impact>.\"\n\n"
    "Be specific about the operation and the missing check. Do not add any other text."
)

code = '''
def get_user(username):
    query = "SELECT * FROM users WHERE name = '" + username + "'"
    return db.execute(query)
'''

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": f"Analyze this code:\n\n```\n{code}\n```"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=120, do_sample=False)

new_tokens = output[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))
# β†’ This function constructs a SQL query by concatenating user-controlled input
#   without parameterization, which may allow an attacker to inject arbitrary
#   SQL commands and access or modify the database.

Training

Parameter Value
Base model Qwen2.5-Coder-7B-Instruct
Method QLoRA (4-bit NF4)
LoRA rank 16
LoRA alpha 32
Training samples 1,596 (BigVul, filtered to MITRE Top 25)
Epochs 3
Learning rate 2e-4
Batch size 2 (grad accum 8, effective 16)
Hardware Kaggle T4 x2
Final eval loss 0.067

Training data

BigVul β€” real-world C/C++ vulnerable functions from open source projects, filtered to MITRE CWE Top 25 classes.

Template-generated target descriptions were used as training outputs β€” this ensures consistent output format that the downstream RoBERTa classifier can reliably process.

Limitations

  • Training data is primarily C/C++ β€” descriptions for Python/JS/Go may be less precise
  • Template-based targets mean descriptions follow a fixed pattern β€” may not capture all nuances
  • This model alone does not classify CWEs β€” use with vuln-classifier-roberta for full pipeline
  • Requires GPU for practical inference (4-bit quantization via bitsandbytes)

Full pipeline

Code input β†’ vuln-analyzer-qwen-lora β†’ structured description
                                      β†’ vuln-classifier-roberta β†’ CWE ID + confidence

See CodeSentinel for the full working demo.

License

MIT

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for martynattakit/vuln-analyzer-qwen-lora

Base model

Qwen/Qwen2.5-7B
Adapter
(631)
this model

Space using martynattakit/vuln-analyzer-qwen-lora 1