knoveleng/open-rs
Viewer • Updated • 7k • 1.69k • 11
Lora+Grpo training: Open-RS2 model
| Item | Value |
|---|---|
| Base Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| Method | GRPO + LoRA |
| LoRA rank / alpha | 32 / 128 |
| Dataset | knoveleng/open-rs (7,000 samples) |
| Reward Functions | format (w=1.0) + accuracy (w=2.0) |
| Learning Rate | 1e-6 (cosine with min lr) |
| Batch Size | 6 per device × 4 grad accum = effective 24 |
| Max Completion Length | 3584 |
| Hardware | 2× NVIDIA A100-SXM4-80GB (RunPod) |
| Training Steps | 600 / 850 |
| Trainable Parameters | 36.9M |
12 checkpoints from step 50 to step 600 (every 50 steps). Paper reports best performance at step 450 (Avg 50.60% across 6 benchmarks).
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "whalexdfsa/open-rs2-lora-grpo", subfolder="checkpoint-450")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
WandB: View training curves
GitHub: LYF22034/open-rs2-grpo-lora
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B