PEFT
Safetensors
grpo
trl
autodata
synthetic-data

autodata-policy-cs

A Qwen2.5-0.5B-Instruct policy fine-tuned with GRPO (Group Relative Policy Optimization) on synthetic CS-reasoning data generated by Autodata Studio β€” an implementation of the agentic self-instruct loop from Autodata: An agentic data scientist to create high-quality synthetic data (arXiv:2606.25996v2).

This is a testing-phase / infrastructure-validation run, not a production model. Its purpose was to prove the full pipeline end to end on real hardware.

What this run proved

Stage Result
72B challenger generates calibrated CS questions βœ… 5/8 source docs produced a real weak/strong gap (30–60 pts)
Curated data pushed to the Hub βœ… ligaments-dev/autodata-grpo-cs
GRPO training on HF Jobs (A10G, 24GB) βœ… 100 steps, LoRA, programmatic reward, ~17 min
Trained adapter pushed to the Hub βœ… this repo

Honest limitations of this run

  • Reward did not improve. Mean reward oscillated around 0.47–0.55 across all 100 steps (no upward trend). The model was trained, but not measurably improved.
  • Root cause: completions/clipped_ratio = 1.0 β€” every generation hit the 256-token cap and never emitted a stop token, so the token-overlap reward stayed ~constant and GRPO had no usable gradient.
  • Tiny dataset: only 5 prompts β†’ 20 epochs of overfitting, no generalization signal.

What a real (improving) run needs

  1. More data β€” hundreds of accepted prompts, not 5.
  2. Fix completion termination β€” investigate why EOS is never emitted; raise max_completion_length and/or correct the chat/generation config.
  3. A richer reward β€” swap the lexical-overlap proxy for the paper's rubric/LLM-judge reward, or add a stop-token / brevity shaping term.
  4. Scale the GPU β€” move from a10g-small to a100-large once the dataset and reward are sound.

Training configuration

  • Base model: Qwen/Qwen2.5-0.5B-Instruct
  • Method: GRPO + LoRA (r=16, alpha=32, q/k/v/o projections)
  • Reward: token-F1 overlap vs. reference answer + length/format shaping (programmatic, no API)
  • Steps: 100, lr 1e-5, num_generations 8, max_completion_length 256, bf16
  • Hardware: 1Γ— A10G (24 GB) via HF Jobs

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "ligaments-dev/autodata-policy-cs")
tok = AutoTokenizer.from_pretrained("ligaments-dev/autodata-policy-cs")
Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ligaments-dev/autodata-policy-cs

Adapter
(650)
this model

Dataset used to train ligaments-dev/autodata-policy-cs

Paper for ligaments-dev/autodata-policy-cs