a3-rl-DCAgent_exp_rpt_pymethods2test-v3 (global_step 10, 8B)
RL (SkyRL) checkpoint trained from base
laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink
on dataset DCAgent/exp_rpt_pymethods2test-v3.
Checkpoint selected at global_step 10 by 5-period reward-EMA (alpha=1/3) over the full stitched step sequence. Steps 15-25 were degenerate resume-creep links (data consumption mismatch, no real reward), so the last genuinely-trained aligned checkpoint (step 10) was chosen.
Training Traces
Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: penfever/a3-rl-DCAgent_exp_rpt_pymethods2test-v3
The dataset contains the last episode of each trial (per
make_and_upload_trace_dataset --episodes last) -- the same rollouts
the policy was trained on after rollback / truncation.
- Downloads last month
- 5
Model tree for laion/a3-rl-DCAgent_exp_rpt_pymethods2test-v3-10-8B
Base model
Qwen/Qwen3-8B-Base