rloo-ab-arm1-tis-seqnorm-step45-8B
RLOO length-bias A/B ablation (task #212) โ arm1-tis: TIS-on, seqnorm arm.
- Algorithm: RLOO-n (
advantage_estimator=rloo_n),n_samples_per_prompt=8 - Loss reduction:
seq_mean_token_sum_norm_global(sequence-mean / seqnorm โ the length-bias-corrected reduction under test) - TIS: on (
use_tis=true,tis_imp_ratio_cap=2.0) - Base (pre-RL) model: laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (a Qwen3-8B SFT)
- Dataset: DCAgent/exp_rpt_pymethods2test-large (pymethods2test-large)
- Checkpoint: global_step 45 โ selected by trailing-5 EMA (alpha=1/3) of
reward/avg_raw_reward, capped at <= step 80 for A/B parity (run reached step 90, but the comparison window is <= 80). - Cluster: Jupiter (FZ-Julich), SkyRL FSDP2, jobs 653650-653656 -> 672321-672326 (resume chain).
See rl_config.json and training_logs/ for full hyperparameters and parsed metrics.
Training Traces
Training-time Daytona/Harbor rollouts for this run are uploaded as companion datasets (the run produced trials across two experiment dirs over the resume chain; both are preserved in full):
- penfever/ablation-pymethods2test-seqnorm-tis (base run dir; ~43.8k trials)
- penfever/ablation-pymethods2test-seqnorm-tis-part2 (
_2resume dir; ~12.9k trials)
Each dataset contains the `last` episode of every trial (per `make_and_upload_trace_dataset --episodes last`) โ the same rollouts the policy was trained on after rollback / truncation. No subsampling.
- Downloads last month
- 38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for laion/rloo-ab-arm1-tis-seqnorm-step45-8B
Base model
Qwen/Qwen3-8B-Base Finetuned
Qwen/Qwen3-8B