Text Generation
PEFT
Safetensors
grpo
lora
math-reasoning
cross-lingual
sub-3b
trl

Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale

Multi-seed empirical study (v2) of GRPO post-training at sub-3B scale on a single-GPU LoRA constraint. Compares four arms across three random seeds on AMC-23, MATH-500, AIME-2024, and 10-language MGSM.

Author: Vu Dang (Independent Researcher) — vu.dh4494@gmail.com Date: 2026-05-11

Three orthogonal findings

  1. A3 (lang-consistency reward) achieves the highest mean AIME maj@8 with +4.4pp over base, robust across seeds.
  2. Vanilla English GRPO (A1) exhibits σ=11.3pp seed variance on AMC-23 — confirming single-seed claims unreliable at this scale.
  3. All training arms ≈ base on 10-language MGSM — no cross-lingual transfer observed.
  4. A4 constant-bias ablation partially refutes a pure reward-magnitude mechanism interpretation.

Repository contents

Path Description
paper/ Full manuscript: main.pdf, main.tex, refs.bib, figures/, tables/, appendix.tex
paper/MODEL_CARD.md Detailed model card for released LoRA adapters
configs/ All training/eval YAML configs (SFT, GRPO arms A1–A4, eval)
results/eval/ Full eval JSON outputs with responses[] arrays (AMC23, MATH-500, AIME-2024, MGSM 10 langs)
results/master.csv Aggregated metrics across all runs
results/grpo/{run_id}/checkpoint-50/ LoRA adapters + merged models per seed

Released checkpoints

All checkpoints use base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with LoRA r=16, α=32, dropout=0.05, target=q_proj,k_proj,v_proj,o_proj. Trained for 50 steps with effective batch 96 on 1×A100 80GB.

Run ID Arm Training data Rewards Seed
reproduce_openrs_rs2_7 A1 (Open-RS RS2) knoveleng/open-rs (7K EN) R1+R2 7
reproduce_openrs_rs2_123 A1 (Open-RS RS2) knoveleng/open-rs (7K EN) R1+R2 123
a2_vi_7 A2 (VI-translated) 5CD-AI/Vietnamese-meta-math-MetaMathQA-40K-gg-translated (5,203 filtered) R1+R2 7
a2_vi_123 A2 (VI-translated) same R1+R2 123
a3_enlang_7 A3 (EN + lang reward) knoveleng/open-rs (7K EN) R1+R2+R5 (fastText) 7
a3_enlang_123 A3 (EN + lang reward) same R1+R2+R5 123

Inference example

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

model = PeftModel.from_pretrained(base, "results/grpo/a3_enlang_7/checkpoint-50/")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": (
        "Solve the following math problem efficiently and clearly. "
        "The last line of your response should be of the following format: "
        "'Therefore, the final answer is: $\\boxed{ANSWER}$.' "
        "Think step by step.\n\n"
        "What is the sum of all positive integers less than 100 divisible by 7?")}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)
print(tok.decode(outputs[0], skip_special_tokens=True))

Reproducibility

AI assistance disclosure

Source code and manuscript prepared with AI assistance (Anthropic Claude via Claude Code CLI); all empirical results independently verified. See VERIFICATION.md for full AI-assistance disclosure.

Citation

@software{dang2026xlinggrpo,
  author = {Dang, Vu},
  title = {Beyond English-Only GRPO: A Multi-Seed Empirical Study at Sub-3B Scale},
  year = {2026},
  version = {2.0},
  doi = {10.5281/zenodo.20061328},
  url = {https://huggingface.co/vudang449/xling-grpo-sub3b}
}

License

LoRA adapter weights and code: Apache-2.0. Paper text: CC-BY-4.0.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vudang449/xling-grpo-sub3b

Adapter
(317)
this model

Datasets used to train vudang449/xling-grpo-sub3b