Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale

Multi-seed empirical study (v2) of GRPO post-training at sub-3B scale on a single-GPU LoRA constraint. Compares four arms across three random seeds on AMC-23, MATH-500, AIME-2024, and 10-language MGSM.

Author: Vu Dang (Independent Researcher) — vu.dh4494@gmail.com Date: 2026-05-11

Three orthogonal findings

A3 (lang-consistency reward) achieves the highest mean AIME maj@8 with +4.4pp over base, robust across seeds.
Vanilla English GRPO (A1) exhibits σ=11.3pp seed variance on AMC-23 — confirming single-seed claims unreliable at this scale.
All training arms ≈ base on 10-language MGSM — no cross-lingual transfer observed.
A4 constant-bias ablation partially refutes a pure reward-magnitude mechanism interpretation.

Repository contents

Path	Description
`paper/`	Full manuscript: `main.pdf`, `main.tex`, `refs.bib`, `figures/`, `tables/`, `appendix.tex`
`paper/MODEL_CARD.md`	Detailed model card for released LoRA adapters
`configs/`	All training/eval YAML configs (SFT, GRPO arms A1–A4, eval)
`results/eval/`	Full eval JSON outputs with `responses[]` arrays (AMC23, MATH-500, AIME-2024, MGSM 10 langs)
`results/master.csv`	Aggregated metrics across all runs
`results/grpo/{run_id}/checkpoint-50/`	LoRA adapters + merged models per seed

Released checkpoints

All checkpoints use base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with LoRA r=16, α=32, dropout=0.05, target=q_proj,k_proj,v_proj,o_proj. Trained for 50 steps with effective batch 96 on 1×A100 80GB.

Run ID	Arm	Training data	Rewards	Seed
`reproduce_openrs_rs2_7`	A1 (Open-RS RS2)	`knoveleng/open-rs` (7K EN)	R1+R2	7
`reproduce_openrs_rs2_123`	A1 (Open-RS RS2)	`knoveleng/open-rs` (7K EN)	R1+R2	123
`a2_vi_7`	A2 (VI-translated)	`5CD-AI/Vietnamese-meta-math-MetaMathQA-40K-gg-translated` (5,203 filtered)	R1+R2	7
`a2_vi_123`	A2 (VI-translated)	same	R1+R2	123
`a3_enlang_7`	A3 (EN + lang reward)	`knoveleng/open-rs` (7K EN)	R1+R2+R5 (fastText)	7
`a3_enlang_123`	A3 (EN + lang reward)	same	R1+R2+R5	123

Inference example

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

model = PeftModel.from_pretrained(base, "results/grpo/a3_enlang_7/checkpoint-50/")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": (
        "Solve the following math problem efficiently and clearly. "
        "The last line of your response should be of the following format: "
        "'Therefore, the final answer is: $\\boxed{ANSWER}$.' "
        "Think step by step.\n\n"
        "What is the sum of all positive integers less than 100 divisible by 7?")}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)
print(tok.decode(outputs[0], skip_special_tokens=True))

Reproducibility

Framework: TRL 0.15.2 GRPOTrainer with in-process vLLM 0.7.2 rollout
Hardware: 1× NVIDIA A100-SXM4-80GB (Vast.ai cloud)
Seeds: 7, 123 (some runs include 42 for ablation)
Decontamination: 8-gram match against test sets before training
Code repo: https://github.com/vudang4494/xling-grpo-sub3b
Zenodo DOI: 10.5281/zenodo.20061328

AI assistance disclosure

Source code and manuscript prepared with AI assistance (Anthropic Claude via Claude Code CLI); all empirical results independently verified. See VERIFICATION.md for full AI-assistance disclosure.

Citation

@software{dang2026xlinggrpo,
  author = {Dang, Vu},
  title = {Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale},
  year = {2026},
  version = {2.0},
  doi = {10.5281/zenodo.20061328},
  url = {https://huggingface.co/vudang449/xling-grpo-sub3b}
}

License

LoRA adapter weights and code: Apache-2.0. Paper text: CC-BY-4.0.

Downloads last month: -

Model tree for vudang449/xling-grpo-sub3b

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Adapter

(317)

this model

vudang449
/

xling-grpo-sub3b