Qwen3-0.6B RLVR-GRPO

Overview

This repository contains a Qwen3-0.6B checkpoint trained using Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) on mathematical reasoning tasks.

The checkpoint was produced as part of the Open Post-Training System project, an open-source effort focused on reproducing and studying modern reasoning-model post-training techniques.

Open Post-Training System
https://github.com/shaheennabi/open-posttraining-system

Model Details

Property Value
Base Model Qwen/Qwen3-0.6B
Training Method RLVR + GRPO
Domain Mathematical Reasoning
Dataset rasbt/math_full_minus_math500
Training Steps Multiple GRPO Steps

Training Method

This model was trained using:

  • Reinforcement Learning with Verifiable Rewards (RLVR)
  • Group Relative Policy Optimization (GRPO)
  • Multiple rollout generation per prompt
  • Rule-based mathematical answer verification
  • Outcome-based reward assignment

Rewards are assigned based on the correctness of the final answer rather than matching a reference reasoning trace.

Usage

This repository already includes:

  • GRPO-trained checkpoint
  • Tokenizer files
  • Model configuration files

Load the model directly from Hugging Face using the Open Post-Training System loader:

from model_and_tokenizer import load_model_and_tokenizer

model, tokenizer = load_model_and_tokenizer(
    checkpoint_path="devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints"
)

Example Prompt

Solve the following problem:

If a train travels 120 km in 2 hours, what is its average speed?

Intended Uses

  • Mathematical reasoning experiments
  • RLVR research
  • GRPO research
  • Reasoning model evaluation
  • Open-source post-training research
  • Educational purposes

Limitations

  • Trained for only multiple steps GRPO optimization steps(multiple checkpoints there).
  • Intended as an experimental research checkpoint.
  • Additional RL training is expected to improve performance.
  • Reasoning quality may vary across problem types.

Open Post-Training System

This model was trained using the Open Post-Training System:

https://github.com/shaheennabi/open-posttraining-system

The project provides open implementations of:

  • Supervised Fine-Tuning (SFT)
  • Reinforcement Learning with Verifiable Rewards (RLVR)
  • Group Relative Policy Optimization (GRPO)
  • Reasoning evaluation pipelines
  • Inference-time scaling
  • Open reasoning model development

Citation

If you use this model, please cite or acknowledge:

  • Qwen3-0.6B
  • math_full_minus_math500
  • Open Post-Training System

Disclaimer

This repository contains an experimental research checkpoint and is not intended for production use.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(1003)
this model

Dataset used to train devshaheen/qwen3.5_0.6B_rlvr_grpo_checkpoints