YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3-1.7B GRPO for Python Code Generation on MBPP
This repo contains a GRPO training script for improving Qwen/Qwen3-1.7B-Base on Python code generation using executable rewards on the google-research-datasets/mbpp dataset.
Training objective
The reward function:
- executes generated Python code in a subprocess,
- scores whether it runs without errors,
- checks whether MBPP assertions pass,
- checks whether the target function has a proper docstring.
Reward weights in the script:
- run without timeout/runtime failure:
0.25 - pass assertions:
0.60 - docstring present:
0.15
Dataset
- Train/eval dataset:
google-research-datasets/mbpp(sanitizedconfig) - Verified columns:
prompt,code,test_imports,test_list - The script converts the dataset to TRL GRPO prompt-only conversational format.
Model
- Base model:
Qwen/Qwen3-1.7B-Base - Architecture verified from model config:
Qwen3ForCausalLM
Reference recipe
Published executable-feedback code RL recipes that informed this setup:
- StepCoder (
2402.01391): compiler/unit-test reward shaping on APPS+ - ACECoder (
2502.01718): large-scale synthesized test-case RLVR on code tasks - DeepSeekMath (
2402.03300): GRPO algorithmic anchor
Launch example
python train_grpo_python_mbpp.py \
--output_dir outputs/qwen3-1.7b-grpo-mbpp \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--num_generations 4 \
--learning_rate 1e-6 \
--max_prompt_length 512 \
--max_completion_length 384 \
--num_train_epochs 1 \
--eval_strategy steps \
--eval_steps 20 \
--save_steps 20 \
--logging_steps 1 \
--bf16 True \
--gradient_checkpointing True \
--report_to trackio \
--run_name grpo_qwen3_1p7b_mbpp_exec_reward \
--project grpo-qwen3-python-code \
--trackio_space_id AbhilekhMeda/mlintern-grpoqwen \
--push_to_hub True \
--hub_model_id AbhilekhMeda/qwen3-1.7b-grpo-python-mbpp
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support