explore-tis-untrunc (global_step 45, 8B)

Agentic RL checkpoint (SkyRL) finetuned from laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B SFT base).

Run config

Algorithm: RLOO-n advantage, seq_mean_token_sum_norm_global loss reduction, no KL loss.
TIS: enabled (tis_imp_ratio_cap=2.0) — truncated importance sampling correcting the vLLM↔FSDP bf16 logprob gap.
Sampling (untruncated exploration variant): the distinguishing feature of this ablation is untruncated sampling during exploration.
Dataset: DCAgent/exp_rpt_pymethods2test-large.
Harness: terminus-2 / Harbor agentic terminal-bench, n_samples_per_prompt=8.

Checkpoint selection

This is global_step 45, selected by trailing-5 EMA (alpha=1/3) of reward/avg_raw_reward, restricted to the genuine training region (steps <= 78). Steps 79+ exhibited an artifact reward jump (single-block greedy / eval-checkpoint passes, not a learning event) and an eternal-retry tail, so they were excluded from candidate selection. At step 45: raw reward ~0.518, trailing-5 EMA ~0.495 (highest among saved exports in-region).

Training Traces

Rollout traces for this run: https://huggingface.co/datasets/penfever/explore-tis-untrunc

Training logs

Parsed SkyRL metrics, vLLM metrics, and raw trainer logs are in the training_logs/ folder of this repo. The serialized launch config is rl_config.yaml.

Downloads last month: 39

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/explore-tis-untrunc-45-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink

Finetuned

(27)

this model