symclip-30-8B

RL (SkyRL GRPO) checkpoint from the symmetric-clip arm of the seqnorm + TIS ablation series. The policy loss uses a symmetric PPO clip range (0.2 / 0.2) together with sequence-normalized loss reduction (loss_reduction=seq_mean_token_sum_norm_global) and Truncated Importance Sampling (TIS) on the rollout logprobs, on untruncated rollouts. This contrasts with the asymmetric-clip configuration used elsewhere in the series.

The rl_config.yaml in this repo is the exact launch config used for reproducibility.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: penfever/symclip

The dataset contains the last episode of each trial (per make_and_upload_trace_dataset --episodes last) — the same rollouts the policy was trained on after rollback / truncation.

Training Metrics

Parsed SkyRL training metrics (per-step CSVs for the full chain 856270 → 856271 → 856272, vLLM serving metrics, and the trial-stats summary) plus the raw training .out logs are included under training_logs/ (parse_skyrl_metrics.py output). Reward peaked early (~0.70 at step 1), settled in the ~0.40-0.46 band through the mid-run, with the trailing-5 EMA maximized at step 30 before declining to ~0.27 by step 78. TIS alignment stayed healthy throughout (exact-match fraction ~0.90-0.94).

Downloads last month
40
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/symclip-30-8B