ablation-pymethods2test-seqmean-arm0-15-8B

RL (SkyRL GRPO) checkpoint from the sequence-mean / RLOO-n (arm0) ablation of the a3-successor study. The policy loss uses loss_reduction=sequence_mean with the advantage_estimator=rloo_n (RLOO-n) estimator, contrasting with the token-mean reduction of the a3 series.

The rl_config.json in this repo is the exact launch config used for reproducibility. This is the step-15 checkpoint of the same run that produced laion/ablation-pymethods2test-seqmean-arm0-30-8B (step 30).

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: penfever/ablation-pymethods2test-seqmean-arm0

The dataset contains the last episode of each trial (per make_and_upload_trace_dataset --episodes last) — the same rollouts the policy was trained on after rollback / truncation.

Downloads last month
36
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/ablation-pymethods2test-seqmean-arm0-15-8B