Tina Open-RS2 Reproduce (GRPO + LoRA)

Lora+Grpo training: Open-RS2 model

Training Details

Item	Value
Base Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Method	GRPO + LoRA
LoRA rank / alpha	32 / 128
Dataset	knoveleng/open-rs (7,000 samples)
Reward Functions	format (w=1.0) + accuracy (w=2.0)
Learning Rate	1e-6 (cosine with min lr)
Batch Size	6 per device × 4 grad accum = effective 24
Max Completion Length	3584
Hardware	2× NVIDIA A100-SXM4-80GB (RunPod)
Training Steps	600 / 850
Trainable Parameters	36.9M

Checkpoints

12 checkpoints from step 50 to step 600 (every 50 steps). Paper reports best performance at step 450 (Avg 50.60% across 6 benchmarks).

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "whalexdfsa/open-rs2-lora-grpo", subfolder="checkpoint-450")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Training Logs

WandB: View training curves

Code

GitHub: LYF22034/open-rs2-grpo-lora

Reference

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whalexdfsa/open-rs2-lora-grpo

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Adapter

(314)

this model

whalexdfsa
/

open-rs2-lora-grpo