qwen-1.5b-sft-v1

Supervised fine-tuned LoRA adapter on Qwen-2.5-Coder-1.5B-Instruct, trained on rejection-sampled MBPP-train completions for verifiable Python code generation.

Highlights

  • +1.1 pts HumanEval+ pass@1 over base (0.6378 vs 0.6268)
  • 17 MB LoRA adapter (rank 16, applied to q/k/v/o projections)
  • Trained on 2,580 rejection-sampled solutions from 319 MBPP-train tasks contamination-filtered against MBPP+ test set
  • Used as the warm-start policy for GRPO post-training in the verifiable-rl-coder project

Evaluation

Evaluated with evalplus at temperature 0.2, n=5 samples per task.

Benchmark Metric Base Qwen-1.5B This model Ξ”
HumanEval+ pass@1 0.6268 0.6378 +1.10
HumanEval+ pass@5 0.7073 0.6951 βˆ’1.22

The pass@1 lift is modest but consistent with what targeted LoRA SFT on a small (319-prompt) dataset is expected to deliver. Most of the improvement comes from a small subset of problems requiring specific structured-knowledge patterns (e.g. Roman numeral subtractive notation, edge-case-aware list operations) β€” see DEMO_EXAMPLES.md for side-by-side comparisons.

Training data

Source Count
MBPP train + validation + prompt splits 974 raw tasks
Disjoint from MBPP+ (contamination-filtered) 319 unique prompts
Rejection-sampled completions kept (passed all hidden tests) 2,580 (prompt, solution) pairs
Overall sandbox pass rate during sampling 50.5%

Rejection sampling settings:

  • N=16 completions per prompt
  • Temperature 0.8
  • Sandboxed Python executor with 5s CPU / 256MB RAM limits

Hyperparameters

Parameter Value
LoRA rank 16
LoRA alpha 32
LoRA target modules q_proj, k_proj, v_proj, o_proj
Epochs 3
Effective batch size 16
Learning rate 2.0e-5
LR scheduler cosine
Warmup ratio 0.03
Optimizer AdamW
Precision bfloat16
Hardware 1Γ— A100 40GB
Wall clock 4.5 minutes
Final train loss 0.632
Final token accuracy 0.902

Visualize in Weights & Biases

Usage

The adapter must be loaded on top of the base Qwen-2.5-Coder-1.5B-Instruct. Don't try to load it directly β€” it's a partial-rank update, not a full model.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
ADAPTER = "dmaheshwar22/qwen-1.5b-sft-v1"

tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

system = (
    "You are an expert Python programmer. Respond with a single Python "
    "code block containing the requested function and nothing else."
)
messages = [
    {"role": "system", "content": system},
    {"role": "user",   "content": "Write a function `fib(n: int) -> int` that returns the nth Fibonacci number."},
]

inputs = tok.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
    return_dict=True,
).to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        do_sample=True,
        top_p=0.95,
        pad_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

For faster inference, merge the adapter into the base once and serve via vLLM:

python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct')
m = PeftModel.from_pretrained(base, 'dmaheshwar22/qwen-1.5b-sft-v1')
m = m.merge_and_unload()
m.save_pretrained('./qwen-1.5b-sft-v1-merged')
AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-1.5B-Instruct').save_pretrained('./qwen-1.5b-sft-v1-merged')
"

vllm serve ./qwen-1.5b-sft-v1-merged --max-model-len 4096

Intended use

Short-form Python function generation in the style of MBPP / HumanEval problems. Designed as the warm-start policy for GRPO RL training that uses sandboxed test-execution as the reward signal.

Reasonable use cases:

  • Code completion / generation in interactive coding tools
  • Baseline for further fine-tuning experiments
  • Comparison against base model in research on small-scale post-training

Limitations

  • Small model (1.5B) β€” competent on isolated Python functions, weak on multi-file repositories or large-context reasoning
  • Narrow training distribution β€” trained only on MBPP-style algorithmic problems; less reliable on competitive programming, systems code, or domain-specific libraries
  • No RLHF for general instruction-following β€” outputs assume the MBPP-style chat format shown above
  • Inherits any biases / safety properties of Qwen-2.5-Coder base
  • The +1.1 pt HumanEval+ improvement is statistically modest (within ~3.8 pt noise floor for n=164 tasks) β€” gains are real but bounded; see project repo for the analysis methodology

Citation

@misc{maheshwari2026verifiable,
  title  = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
  author = {Maheshwari, Devesh},
  year   = {2026},
  url    = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
}

Acknowledgements

Built on:

Trained on the UW-Madison CHTC cluster.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dmaheshwar22/qwen-1.5b-coder-sft-v1

Adapter
(114)
this model

Dataset used to train dmaheshwar22/qwen-1.5b-coder-sft-v1

Space using dmaheshwar22/qwen-1.5b-coder-sft-v1 1