Qwen3-1.7B GRPO for Python Code Generation on MBPP

This repo contains a GRPO training script for improving Qwen/Qwen3-1.7B-Base on Python code generation using executable rewards on the google-research-datasets/mbpp dataset.

Training objective

The reward function:

executes generated Python code in a subprocess,
scores whether it runs without errors,
checks whether MBPP assertions pass,
checks whether the target function has a proper docstring.

Reward weights in the script:

run without timeout/runtime failure: 0.25
pass assertions: 0.60
docstring present: 0.15

Dataset

Train/eval dataset: google-research-datasets/mbpp (sanitized config)
Verified columns: prompt, code, test_imports, test_list
The script converts the dataset to TRL GRPO prompt-only conversational format.

Model

Base model: Qwen/Qwen3-1.7B-Base
Architecture verified from model config: Qwen3ForCausalLM

Reference recipe

Published executable-feedback code RL recipes that informed this setup:

StepCoder (2402.01391): compiler/unit-test reward shaping on APPS+
ACECoder (2502.01718): large-scale synthesized test-case RLVR on code tasks
DeepSeekMath (2402.03300): GRPO algorithmic anchor

Launch example

python train_grpo_python_mbpp.py \
  --output_dir outputs/qwen3-1.7b-grpo-mbpp \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_generations 4 \
  --learning_rate 1e-6 \
  --max_prompt_length 512 \
  --max_completion_length 384 \
  --num_train_epochs 1 \
  --eval_strategy steps \
  --eval_steps 20 \
  --save_steps 20 \
  --logging_steps 1 \
  --bf16 True \
  --gradient_checkpointing True \
  --report_to trackio \
  --run_name grpo_qwen3_1p7b_mbpp_exec_reward \
  --project grpo-qwen3-python-code \
  --trackio_space_id AbhilekhMeda/mlintern-grpoqwen \
  --push_to_hub True \
  --hub_model_id AbhilekhMeda/qwen3-1.7b-grpo-python-mbpp

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support