qwen-1.5b-sft-v1

Supervised fine-tuned LoRA adapter on Qwen-2.5-Coder-1.5B-Instruct, trained on rejection-sampled MBPP-train completions for verifiable Python code generation.

Highlights

+1.1 pts HumanEval+ pass@1 over base (0.6378 vs 0.6268)
17 MB LoRA adapter (rank 16, applied to q/k/v/o projections)
Trained on 2,580 rejection-sampled solutions from 319 MBPP-train tasks contamination-filtered against MBPP+ test set
Used as the warm-start policy for GRPO post-training in the verifiable-rl-coder project

Evaluation

Evaluated with evalplus at temperature 0.2, n=5 samples per task.

Benchmark	Metric	Base Qwen-1.5B	This model	Δ
HumanEval+	pass@1	0.6268	0.6378	+1.10
HumanEval+	pass@5	0.7073	0.6951	−1.22

The pass@1 lift is modest but consistent with what targeted LoRA SFT on a small (319-prompt) dataset is expected to deliver. Most of the improvement comes from a small subset of problems requiring specific structured-knowledge patterns (e.g. Roman numeral subtractive notation, edge-case-aware list operations) — see DEMO_EXAMPLES.md for side-by-side comparisons.

Training data

Source	Count
MBPP train + validation + prompt splits	974 raw tasks
Disjoint from MBPP+ (contamination-filtered)	319 unique prompts
Rejection-sampled completions kept (passed all hidden tests)	2,580 (prompt, solution) pairs
Overall sandbox pass rate during sampling	50.5%

Rejection sampling settings:

N=16 completions per prompt
Temperature 0.8
Sandboxed Python executor with 5s CPU / 256MB RAM limits

Hyperparameters

Parameter	Value
LoRA rank	16
LoRA alpha	32
LoRA target modules	q_proj, k_proj, v_proj, o_proj
Epochs	3
Effective batch size	16
Learning rate	2.0e-5
LR scheduler	cosine
Warmup ratio	0.03
Optimizer	AdamW
Precision	bfloat16
Hardware	1× A100 40GB
Wall clock	4.5 minutes
Final train loss	0.632
Final token accuracy	0.902

Usage

The adapter must be loaded on top of the base Qwen-2.5-Coder-1.5B-Instruct. Don't try to load it directly — it's a partial-rank update, not a full model.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
ADAPTER = "dmaheshwar22/qwen-1.5b-sft-v1"

tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

system = (
    "You are an expert Python programmer. Respond with a single Python "
    "code block containing the requested function and nothing else."
)
messages = [
    {"role": "system", "content": system},
    {"role": "user",   "content": "Write a function `fib(n: int) -> int` that returns the nth Fibonacci number."},
]

inputs = tok.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        do_sample=True,
        top_p=0.95,
        pad_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

For faster inference, merge the adapter into the base once and serve via vLLM:

python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct')
m = PeftModel.from_pretrained(base, 'dmaheshwar22/qwen-1.5b-sft-v1')
m = m.merge_and_unload()
m.save_pretrained('./qwen-1.5b-sft-v1-merged')
AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct').save_pretrained('./qwen-1.5b-sft-v1-merged')
"

vllm serve ./qwen-1.5b-sft-v1-merged --max-model-len 4096

Intended use

Short-form Python function generation in the style of MBPP / HumanEval problems. Designed as the warm-start policy for GRPO RL training that uses sandboxed test-execution as the reward signal.

Reasonable use cases:

Code completion / generation in interactive coding tools
Baseline for further fine-tuning experiments
Comparison against base model in research on small-scale post-training

Limitations

Small model (1.5B) — competent on isolated Python functions, weak on multi-file repositories or large-context reasoning
Narrow training distribution — trained only on MBPP-style algorithmic problems; less reliable on competitive programming, systems code, or domain-specific libraries
No RLHF for general instruction-following — outputs assume the MBPP-style chat format shown above
Inherits any biases / safety properties of Qwen-2.5-Coder base
The +1.1 pt HumanEval+ improvement is statistically modest (within ~3.8 pt noise floor for n=164 tasks) — gains are real but bounded; see project repo for the analysis methodology

Citation

@misc{maheshwari2026verifiable,
  title  = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
  author = {Maheshwari, Devesh},
  year   = {2026},
  url    = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
}

Acknowledgements

Built on:

Qwen-2.5-Coder — base model
TRL — SFT trainer
PEFT — LoRA implementation
evalplus — benchmark harness

Trained on the UW-Madison CHTC cluster.

Downloads last month: 5

Model tree for dmaheshwar22/qwen-1.5b-coder-sft-v1

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B-Instruct

Adapter

(114)

this model

dmaheshwar22
/

qwen-1.5b-coder-sft-v1