Tina Open-RS2 Reproduce (GRPO + LoRA)

Lora+Grpo training: Open-RS2 model

Training Details

Item Value
Base Model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Method GRPO + LoRA
LoRA rank / alpha 32 / 128
Dataset knoveleng/open-rs (7,000 samples)
Reward Functions format (w=1.0) + accuracy (w=2.0)
Learning Rate 1e-6 (cosine with min lr)
Batch Size 6 per device × 4 grad accum = effective 24
Max Completion Length 3584
Hardware 2× NVIDIA A100-SXM4-80GB (RunPod)
Training Steps 600 / 850
Trainable Parameters 36.9M

Checkpoints

12 checkpoints from step 50 to step 600 (every 50 steps). Paper reports best performance at step 450 (Avg 50.60% across 6 benchmarks).

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "whalexdfsa/open-rs2-lora-grpo", subfolder="checkpoint-450")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Training Logs

WandB: View training curves

Code

GitHub: LYF22034/open-rs2-grpo-lora

Reference

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whalexdfsa/open-rs2-lora-grpo

Adapter
(314)
this model

Dataset used to train whalexdfsa/open-rs2-lora-grpo