Instructions to use dmaheshwar22/qwen-1.5b-coder-sft-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use dmaheshwar22/qwen-1.5b-coder-sft-v1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-1.5B-Instruct") model = PeftModel.from_pretrained(base_model, "dmaheshwar22/qwen-1.5b-coder-sft-v1") - Notebooks
- Google Colab
- Kaggle
qwen-1.5b-sft-v1
Supervised fine-tuned LoRA adapter on Qwen-2.5-Coder-1.5B-Instruct, trained on rejection-sampled MBPP-train completions for verifiable Python code generation.
Highlights
- +1.1 pts HumanEval+ pass@1 over base (0.6378 vs 0.6268)
- 17 MB LoRA adapter (rank 16, applied to q/k/v/o projections)
- Trained on 2,580 rejection-sampled solutions from 319 MBPP-train tasks contamination-filtered against MBPP+ test set
- Used as the warm-start policy for GRPO post-training in the verifiable-rl-coder project
Evaluation
Evaluated with evalplus at temperature 0.2, n=5 samples per task.
| Benchmark | Metric | Base Qwen-1.5B | This model | Ξ |
|---|---|---|---|---|
| HumanEval+ | pass@1 | 0.6268 | 0.6378 | +1.10 |
| HumanEval+ | pass@5 | 0.7073 | 0.6951 | β1.22 |
The pass@1 lift is modest but consistent with what targeted LoRA SFT on a small (319-prompt) dataset is expected to deliver. Most of the improvement comes from a small subset of problems requiring specific structured-knowledge patterns (e.g. Roman numeral subtractive notation, edge-case-aware list operations) β see DEMO_EXAMPLES.md for side-by-side comparisons.
Training data
| Source | Count |
|---|---|
| MBPP train + validation + prompt splits | 974 raw tasks |
| Disjoint from MBPP+ (contamination-filtered) | 319 unique prompts |
| Rejection-sampled completions kept (passed all hidden tests) | 2,580 (prompt, solution) pairs |
| Overall sandbox pass rate during sampling | 50.5% |
Rejection sampling settings:
- N=16 completions per prompt
- Temperature 0.8
- Sandboxed Python executor with 5s CPU / 256MB RAM limits
Hyperparameters
| Parameter | Value |
|---|---|
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj |
| Epochs | 3 |
| Effective batch size | 16 |
| Learning rate | 2.0e-5 |
| LR scheduler | cosine |
| Warmup ratio | 0.03 |
| Optimizer | AdamW |
| Precision | bfloat16 |
| Hardware | 1Γ A100 40GB |
| Wall clock | 4.5 minutes |
| Final train loss | 0.632 |
| Final token accuracy | 0.902 |
Usage
The adapter must be loaded on top of the base Qwen-2.5-Coder-1.5B-Instruct. Don't try to load it directly β it's a partial-rank update, not a full model.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
ADAPTER = "dmaheshwar22/qwen-1.5b-sft-v1"
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(
BASE, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
system = (
"You are an expert Python programmer. Respond with a single Python "
"code block containing the requested function and nothing else."
)
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "Write a function `fib(n: int) -> int` that returns the nth Fibonacci number."},
]
inputs = tok.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True,
return_dict=True,
).to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=True,
top_p=0.95,
pad_token_id=tok.eos_token_id,
)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
For faster inference, merge the adapter into the base once and serve via vLLM:
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct')
m = PeftModel.from_pretrained(base, 'dmaheshwar22/qwen-1.5b-sft-v1')
m = m.merge_and_unload()
m.save_pretrained('./qwen-1.5b-sft-v1-merged')
AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct').save_pretrained('./qwen-1.5b-sft-v1-merged')
"
vllm serve ./qwen-1.5b-sft-v1-merged --max-model-len 4096
Intended use
Short-form Python function generation in the style of MBPP / HumanEval problems. Designed as the warm-start policy for GRPO RL training that uses sandboxed test-execution as the reward signal.
Reasonable use cases:
- Code completion / generation in interactive coding tools
- Baseline for further fine-tuning experiments
- Comparison against base model in research on small-scale post-training
Limitations
- Small model (1.5B) β competent on isolated Python functions, weak on multi-file repositories or large-context reasoning
- Narrow training distribution β trained only on MBPP-style algorithmic problems; less reliable on competitive programming, systems code, or domain-specific libraries
- No RLHF for general instruction-following β outputs assume the MBPP-style chat format shown above
- Inherits any biases / safety properties of Qwen-2.5-Coder base
- The +1.1 pt HumanEval+ improvement is statistically modest (within ~3.8 pt noise floor for n=164 tasks) β gains are real but bounded; see project repo for the analysis methodology
Citation
@misc{maheshwari2026verifiable,
title = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
author = {Maheshwari, Devesh},
year = {2026},
url = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
}
Acknowledgements
Built on:
- Qwen-2.5-Coder β base model
- TRL β SFT trainer
- PEFT β LoRA implementation
- evalplus β benchmark harness
Trained on the UW-Madison CHTC cluster.
- Downloads last month
- 5
Model tree for dmaheshwar22/qwen-1.5b-coder-sft-v1
Base model
Qwen/Qwen2.5-1.5B