SPSD-RL-Qwen3-4B-Instruct

This is the final loadable checkpoint from the Qwen3-4B-Instruct-2507 SDFT parity-400 run trained on SPSD-RL/MCTS-style supervision.

Local training artifact: outputs/qwen3-4b-instruct-2507-sdft-mctsstyle-parity400-lr1em6-cumem0-fixedfmt-20260611-194422

The trainer state in checkpoint-400 records global_step=400 and max_steps=400. Train-time evaluation was disabled after the step-100 TRL experimental SDFT eval-path crash; post-hoc reasoning and OpenReward benchmark evaluations are the source of truth for this artifact.

Uploaded Files

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • chat_template.jinja

Checkpoint directories, optimizer state, scheduler state, logs, local caches, and trainer process artifacts are intentionally excluded.

Evaluation Status

The fixed forced-boxed evaluation suite is launched separately from the local final model root using the repository sft_boxed_forced profile.

Downloads last month
21
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LorMolf/SPSD-RL-Qwen3-4B-Instruct

Finetuned
(1750)
this model