Qwen3-0.6B RLVR-GRPO

Overview

This repository contains a Qwen3-0.6B checkpoint trained using Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) on mathematical reasoning tasks.

The checkpoint was produced as part of the Open Post-Training System project, an open-source effort focused on reproducing and studying modern reasoning-model post-training techniques.

Open Post-Training System
https://github.com/shaheennabi/open-posttraining-system

Model Details

Property	Value
Base Model	Qwen/Qwen3-0.6B
Training Method	RLVR + GRPO
Domain	Mathematical Reasoning
Dataset	rasbt/math_full_minus_math500
Training Steps	Multiple GRPO Steps

Training Method

This model was trained using:

Reinforcement Learning with Verifiable Rewards (RLVR)
Group Relative Policy Optimization (GRPO)
Multiple rollout generation per prompt
Rule-based mathematical answer verification
Outcome-based reward assignment

Rewards are assigned based on the correctness of the final answer rather than matching a reference reasoning trace.

Usage

This repository already includes:

GRPO-trained checkpoint
Tokenizer files
Model configuration files

Load the model directly from Hugging Face using the Open Post-Training System loader:

from model_and_tokenizer import load_model_and_tokenizer

model, tokenizer = load_model_and_tokenizer(
    checkpoint_path="devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints"
)

Example Prompt

Solve the following problem:

If a train travels 120 km in 2 hours, what is its average speed?

Intended Uses

Mathematical reasoning experiments
RLVR research
GRPO research
Reasoning model evaluation
Open-source post-training research
Educational purposes

Limitations

Trained for only multiple steps GRPO optimization steps(multiple checkpoints there).
Intended as an experimental research checkpoint.
Additional RL training is expected to improve performance.
Reasoning quality may vary across problem types.

Open Post-Training System

This model was trained using the Open Post-Training System:

https://github.com/shaheennabi/open-posttraining-system

The project provides open implementations of:

Supervised Fine-Tuning (SFT)
Reinforcement Learning with Verifiable Rewards (RLVR)
Group Relative Policy Optimization (GRPO)
Reasoning evaluation pipelines
Inference-time scaling
Open reasoning model development

Citation

If you use this model, please cite or acknowledge:

Qwen3-0.6B
math_full_minus_math500
Open Post-Training System

Disclaimer

This repository contains an experimental research checkpoint and is not intended for production use.

Downloads last month: 32

Model tree for devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B