autodata-policy-cs

A Qwen2.5-0.5B-Instruct policy fine-tuned with GRPO (Group Relative Policy Optimization) on synthetic CS-reasoning data generated by Autodata Studio — an implementation of the agentic self-instruct loop from Autodata: An agentic data scientist to create high-quality synthetic data (arXiv:2606.25996v2).

This is a testing-phase / infrastructure-validation run, not a production model. Its purpose was to prove the full pipeline end to end on real hardware.

What this run proved

Stage	Result
72B challenger generates calibrated CS questions	✅ 5/8 source docs produced a real weak/strong gap (30–60 pts)
Curated data pushed to the Hub	✅ `ligaments-dev/autodata-grpo-cs`
GRPO training on HF Jobs (A10G, 24GB)	✅ 100 steps, LoRA, programmatic reward, ~17 min
Trained adapter pushed to the Hub	✅ this repo

Honest limitations of this run

Reward did not improve. Mean reward oscillated around 0.47–0.55 across all 100 steps (no upward trend). The model was trained, but not measurably improved.
Root cause: completions/clipped_ratio = 1.0 — every generation hit the 256-token cap and never emitted a stop token, so the token-overlap reward stayed ~constant and GRPO had no usable gradient.
Tiny dataset: only 5 prompts → 20 epochs of overfitting, no generalization signal.

What a real (improving) run needs

More data — hundreds of accepted prompts, not 5.
Fix completion termination — investigate why EOS is never emitted; raise max_completion_length and/or correct the chat/generation config.
A richer reward — swap the lexical-overlap proxy for the paper's rubric/LLM-judge reward, or add a stop-token / brevity shaping term.
Scale the GPU — move from a10g-small to a100-large once the dataset and reward are sound.

Training configuration

Base model: Qwen/Qwen2.5-0.5B-Instruct
Method: GRPO + LoRA (r=16, alpha=32, q/k/v/o projections)
Reward: token-F1 overlap vs. reference answer + length/format shaping (programmatic, no API)
Steps: 100, lr 1e-5, num_generations 8, max_completion_length 256, bf16
Hardware: 1× A10G (24 GB) via HF Jobs

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base, "ligaments-dev/autodata-policy-cs")
tok = AutoTokenizer.from_pretrained("ligaments-dev/autodata-policy-cs")

Downloads last month: 42

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ligaments-dev/autodata-policy-cs

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(650)

this model

Dataset used to train ligaments-dev/autodata-policy-cs

Paper for ligaments-dev/autodata-policy-cs

Autodata: An agentic data scientist to create high quality synthetic data

Paper • 2606.25996 • Published 9 days ago • 18