explore-tis-untrunc (global_step 45, 8B)
Agentic RL checkpoint (SkyRL) finetuned from
laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B SFT base).
Run config
- Algorithm: RLOO-n advantage,
seq_mean_token_sum_norm_globalloss reduction, no KL loss. - TIS: enabled (
tis_imp_ratio_cap=2.0) — truncated importance sampling correcting the vLLM↔FSDP bf16 logprob gap. - Sampling (untruncated exploration variant): the distinguishing feature of this ablation is untruncated sampling during exploration.
- Dataset:
DCAgent/exp_rpt_pymethods2test-large. - Harness: terminus-2 / Harbor agentic terminal-bench, n_samples_per_prompt=8.
Checkpoint selection
This is global_step 45, selected by trailing-5 EMA (alpha=1/3) of reward/avg_raw_reward,
restricted to the genuine training region (steps <= 78). Steps 79+ exhibited an
artifact reward jump (single-block greedy / eval-checkpoint passes, not a learning event)
and an eternal-retry tail, so they were excluded from candidate selection.
At step 45: raw reward ~0.518, trailing-5 EMA ~0.495 (highest among saved exports in-region).
Training Traces
Rollout traces for this run: https://huggingface.co/datasets/penfever/explore-tis-untrunc
Training logs
Parsed SkyRL metrics, vLLM metrics, and raw trainer logs are in the training_logs/ folder of this repo.
The serialized launch config is rl_config.yaml.
- Downloads last month
- 39
Model tree for laion/explore-tis-untrunc-45-8B
Base model
Qwen/Qwen3-8B-Base