rloo-ab-arm1-tis-seqnorm-step45-8B

RLOO length-bias A/B ablation (task #212) โ€” arm1-tis: TIS-on, seqnorm arm.

  • Algorithm: RLOO-n (advantage_estimator=rloo_n), n_samples_per_prompt=8
  • Loss reduction: seq_mean_token_sum_norm_global (sequence-mean / seqnorm โ€” the length-bias-corrected reduction under test)
  • TIS: on (use_tis=true, tis_imp_ratio_cap=2.0)
  • Base (pre-RL) model: laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (a Qwen3-8B SFT)
  • Dataset: DCAgent/exp_rpt_pymethods2test-large (pymethods2test-large)
  • Checkpoint: global_step 45 โ€” selected by trailing-5 EMA (alpha=1/3) of reward/avg_raw_reward, capped at <= step 80 for A/B parity (run reached step 90, but the comparison window is <= 80).
  • Cluster: Jupiter (FZ-Julich), SkyRL FSDP2, jobs 653650-653656 -> 672321-672326 (resume chain).

See rl_config.json and training_logs/ for full hyperparameters and parsed metrics.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as companion datasets (the run produced trials across two experiment dirs over the resume chain; both are preserved in full):

Each dataset contains the `last` episode of every trial (per `make_and_upload_trace_dataset --episodes last`) โ€” the same rollouts the policy was trained on after rollback / truncation. No subsampling.

Downloads last month
38
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for laion/rloo-ab-arm1-tis-seqnorm-step45-8B