rloo-ab-arm1-tis-seqnorm-step45-8B

RLOO length-bias A/B ablation (task #212) — arm1-tis: TIS-on, seqnorm arm.

Algorithm: RLOO-n (advantage_estimator=rloo_n), n_samples_per_prompt=8
Loss reduction: seq_mean_token_sum_norm_global (sequence-mean / seqnorm — the length-bias-corrected reduction under test)
TIS: on (use_tis=true, tis_imp_ratio_cap=2.0)
Base (pre-RL) model: laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (a Qwen3-8B SFT)
Dataset: DCAgent/exp_rpt_pymethods2test-large (pymethods2test-large)
Checkpoint: global_step 45 — selected by trailing-5 EMA (alpha=1/3) of reward/avg_raw_reward, capped at <= step 80 for A/B parity (run reached step 90, but the comparison window is <= 80).
Cluster: Jupiter (FZ-Julich), SkyRL FSDP2, jobs 653650-653656 -> 672321-672326 (resume chain).

See rl_config.json and training_logs/ for full hyperparameters and parsed metrics.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as companion datasets (the run produced trials across two experiment dirs over the resume chain; both are preserved in full):

penfever/ablation-pymethods2test-seqnorm-tis (base run dir; ~43.8k trials)
penfever/ablation-pymethods2test-seqnorm-tis-part2 (_2 resume dir; ~12.9k trials)

Each dataset contains the `last` episode of every trial (per `make_and_upload_trace_dataset --episodes last`) — the same rollouts the policy was trained on after rollback / truncation. No subsampling.

Downloads last month: 38

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/rloo-ab-arm1-tis-seqnorm-step45-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink

Finetuned

(27)

this model