Qwen3-0.6B RLVR-GRPO
Overview
This repository contains a Qwen3-0.6B checkpoint trained using Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) on mathematical reasoning tasks.
The checkpoint was produced as part of the Open Post-Training System project, an open-source effort focused on reproducing and studying modern reasoning-model post-training techniques.
Open Post-Training System
https://github.com/shaheennabi/open-posttraining-system
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-0.6B |
| Training Method | RLVR + GRPO |
| Domain | Mathematical Reasoning |
| Dataset | rasbt/math_full_minus_math500 |
| Training Steps | Multiple GRPO Steps |
Training Method
This model was trained using:
- Reinforcement Learning with Verifiable Rewards (RLVR)
- Group Relative Policy Optimization (GRPO)
- Multiple rollout generation per prompt
- Rule-based mathematical answer verification
- Outcome-based reward assignment
Rewards are assigned based on the correctness of the final answer rather than matching a reference reasoning trace.
Usage
This repository already includes:
- GRPO-trained checkpoint
- Tokenizer files
- Model configuration files
Load the model directly from Hugging Face using the Open Post-Training System loader:
from model_and_tokenizer import load_model_and_tokenizer
model, tokenizer = load_model_and_tokenizer(
checkpoint_path="devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints"
)
Example Prompt
Solve the following problem:
If a train travels 120 km in 2 hours, what is its average speed?
Intended Uses
- Mathematical reasoning experiments
- RLVR research
- GRPO research
- Reasoning model evaluation
- Open-source post-training research
- Educational purposes
Limitations
- Trained for only multiple steps GRPO optimization steps(multiple checkpoints there).
- Intended as an experimental research checkpoint.
- Additional RL training is expected to improve performance.
- Reasoning quality may vary across problem types.
Open Post-Training System
This model was trained using the Open Post-Training System:
https://github.com/shaheennabi/open-posttraining-system
The project provides open implementations of:
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning with Verifiable Rewards (RLVR)
- Group Relative Policy Optimization (GRPO)
- Reasoning evaluation pipelines
- Inference-time scaling
- Open reasoning model development
Citation
If you use this model, please cite or acknowledge:
- Qwen3-0.6B
- math_full_minus_math500
- Open Post-Training System
Disclaimer
This repository contains an experimental research checkpoint and is not intended for production use.
- Downloads last month
- 32