Instructions to use vudang449/xling-grpo-sub3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use vudang449/xling-grpo-sub3b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale
Multi-seed empirical study (v2) of GRPO post-training at sub-3B scale on a single-GPU LoRA constraint. Compares four arms across three random seeds on AMC-23, MATH-500, AIME-2024, and 10-language MGSM.
Author: Vu Dang (Independent Researcher) — vu.dh4494@gmail.com
Date: 2026-05-11
Three orthogonal findings
- A3 (lang-consistency reward) achieves the highest mean AIME maj@8 with +4.4pp over base, robust across seeds.
- Vanilla English GRPO (A1) exhibits σ=11.3pp seed variance on AMC-23 — confirming single-seed claims unreliable at this scale.
- All training arms ≈ base on 10-language MGSM — no cross-lingual transfer observed.
- A4 constant-bias ablation partially refutes a pure reward-magnitude mechanism interpretation.
Repository contents
| Path | Description |
|---|---|
paper/ |
Full manuscript: main.pdf, main.tex, refs.bib, figures/, tables/, appendix.tex |
paper/MODEL_CARD.md |
Detailed model card for released LoRA adapters |
configs/ |
All training/eval YAML configs (SFT, GRPO arms A1–A4, eval) |
results/eval/ |
Full eval JSON outputs with responses[] arrays (AMC23, MATH-500, AIME-2024, MGSM 10 langs) |
results/master.csv |
Aggregated metrics across all runs |
results/grpo/{run_id}/checkpoint-50/ |
LoRA adapters + merged models per seed |
Released checkpoints
All checkpoints use base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with LoRA r=16, α=32, dropout=0.05, target=q_proj,k_proj,v_proj,o_proj. Trained for 50 steps with effective batch 96 on 1×A100 80GB.
| Run ID | Arm | Training data | Rewards | Seed |
|---|---|---|---|---|
reproduce_openrs_rs2_7 |
A1 (Open-RS RS2) | knoveleng/open-rs (7K EN) |
R1+R2 | 7 |
reproduce_openrs_rs2_123 |
A1 (Open-RS RS2) | knoveleng/open-rs (7K EN) |
R1+R2 | 123 |
a2_vi_7 |
A2 (VI-translated) | 5CD-AI/Vietnamese-meta-math-MetaMathQA-40K-gg-translated (5,203 filtered) |
R1+R2 | 7 |
a2_vi_123 |
A2 (VI-translated) | same | R1+R2 | 123 |
a3_enlang_7 |
A3 (EN + lang reward) | knoveleng/open-rs (7K EN) |
R1+R2+R5 (fastText) | 7 |
a3_enlang_123 |
A3 (EN + lang reward) | same | R1+R2+R5 | 123 |
Inference example
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
model = PeftModel.from_pretrained(base, "results/grpo/a3_enlang_7/checkpoint-50/")
prompt = tok.apply_chat_template(
[{"role": "user", "content": (
"Solve the following math problem efficiently and clearly. "
"The last line of your response should be of the following format: "
"'Therefore, the final answer is: $\\boxed{ANSWER}$.' "
"Think step by step.\n\n"
"What is the sum of all positive integers less than 100 divisible by 7?")}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)
print(tok.decode(outputs[0], skip_special_tokens=True))
Reproducibility
- Framework: TRL 0.15.2 GRPOTrainer with in-process vLLM 0.7.2 rollout
- Hardware: 1× NVIDIA A100-SXM4-80GB (Vast.ai cloud)
- Seeds: 7, 123 (some runs include 42 for ablation)
- Decontamination: 8-gram match against test sets before training
- Code repo: https://github.com/vudang4494/xling-grpo-sub3b
- Zenodo DOI: 10.5281/zenodo.20061328
AI assistance disclosure
Source code and manuscript prepared with AI assistance (Anthropic Claude via Claude Code CLI); all empirical results independently verified. See VERIFICATION.md for full AI-assistance disclosure.
Citation
@software{dang2026xlinggrpo,
author = {Dang, Vu},
title = {Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale},
year = {2026},
version = {2.0},
doi = {10.5281/zenodo.20061328},
url = {https://huggingface.co/vudang449/xling-grpo-sub3b}
}
License
LoRA adapter weights and code: Apache-2.0. Paper text: CC-BY-4.0.
- Downloads last month
- -
Model tree for vudang449/xling-grpo-sub3b
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B