a3-rl-laion_exp_rpt_stack-bash-v3 (global_step_70)

RL (RLOO_n / SkyRL FSDP2) fine-tune of laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B) on the laion/exp_rpt_stack-bash-v3 agentic task set, agent terminus-2.

PARTIAL / CANCELLED MID-RUN. This run was part of the a3 RL series, which was concluded as uninformative; SLURM job 589415 was cancelled at training step ~73 (2026-06-06). This checkpoint is the latest available HF-ready export (global_step_70) — preserved as the run artifact, not a converged model. The reward signal was low and flat throughout (avg_raw_reward ~0.08-0.21, EMA peak ~0.19 around step 60); grad-norm stayed healthy (<0.012) with no collapse.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: DCAgent/a3-rl-laion_exp_rpt_stack-bash-v3

The dataset contains the last episode of each of the ~41,263 trials (per make_and_upload_trace_dataset --episodes last).

Downloads last month: 24

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/a3-rl-laion_exp_rpt_stack-bash-v3-70-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink

Finetuned

(27)

this model