ligaments-dev/autodata-grpo-cs
Viewer β’ Updated β’ 17 β’ 49
How to use ligaments-dev/autodata-policy-cs with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "ligaments-dev/autodata-policy-cs")A Qwen2.5-0.5B-Instruct policy fine-tuned with GRPO (Group Relative Policy Optimization) on synthetic CS-reasoning data generated by Autodata Studio β an implementation of the agentic self-instruct loop from Autodata: An agentic data scientist to create high-quality synthetic data (arXiv:2606.25996v2).
This is a testing-phase / infrastructure-validation run, not a production model. Its purpose was to prove the full pipeline end to end on real hardware.
| Stage | Result |
|---|---|
| 72B challenger generates calibrated CS questions | β 5/8 source docs produced a real weak/strong gap (30β60 pts) |
| Curated data pushed to the Hub | β
ligaments-dev/autodata-grpo-cs |
| GRPO training on HF Jobs (A10G, 24GB) | β 100 steps, LoRA, programmatic reward, ~17 min |
| Trained adapter pushed to the Hub | β this repo |
completions/clipped_ratio = 1.0 β every generation hit the 256-token cap
and never emitted a stop token, so the token-overlap reward stayed ~constant and GRPO had no
usable gradient.max_completion_length and/or correct the chat/generation config.a10g-small to a100-large once the dataset and reward are sound.Qwen/Qwen2.5-0.5B-Instructfrom peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "ligaments-dev/autodata-policy-cs")
tok = AutoTokenizer.from_pretrained("ligaments-dev/autodata-policy-cs")