a3-rl-laion_exp_rpt_stack-bash-v3 (global_step_70)
RL (RLOO_n / SkyRL FSDP2) fine-tune of
laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink
(Qwen3-8B) on the laion/exp_rpt_stack-bash-v3 agentic task set, agent terminus-2.
PARTIAL / CANCELLED MID-RUN. This run was part of the a3 RL series, which was concluded as uninformative; SLURM job 589415 was cancelled at training step ~73 (2026-06-06). This checkpoint is the latest available HF-ready export (global_step_70) — preserved as the run artifact, not a converged model. The reward signal was low and flat throughout (avg_raw_reward ~0.08-0.21, EMA peak ~0.19 around step 60); grad-norm stayed healthy (<0.012) with no collapse.
Training Traces
Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: DCAgent/a3-rl-laion_exp_rpt_stack-bash-v3
The dataset contains the last episode of each of the ~41,263 trials (per make_and_upload_trace_dataset --episodes last).
- Downloads last month
- 24
Model tree for laion/a3-rl-laion_exp_rpt_stack-bash-v3-70-8B
Base model
Qwen/Qwen3-8B-Base