Qwen3-4B Physics RLSD

This repository contains Physics fine-tuned Qwen3-4B checkpoints from the local SciKnowEval-style generalization setup.

  • Root checkpoint: final global_step_100 merged to Hugging Face safetensors.
  • best_avg16/: checkpoint with the highest validation avg@16 during training, merged to Hugging Face safetensors.

Checkpoints

Checkpoint Source step Validation avg@16 best@16 / pass@16 maj@16
Root final 100 0.700000 0.768713 0.718600
best_avg16/ 60 0.731250 0.816338 0.740938

Training Run

qwen3gen-physics-RLSD-Qwen-Qwen3-4B-mbs8-decay0-ema0.05-train256-rollout8-lr1e-6-vllm0.8

Base Model

  • Base model: Qwen/Qwen3-4B
  • Fine-tuning type: full-parameter FSDP RL training
  • Dataset: datasets/sciknoweval/physics
  • Train split: 720 examples
  • Validation split: 80 examples

Method

  • Method: RLSD
  • Config: rlsd
  • Policy loss mode: rlsd
  • Reward: local SciKnowEval multiple-choice reward checker
  • Rollout correction: token-level importance sampling, threshold 2.0

Hyperparameters

Field Value
Base model Qwen/Qwen3-4B
Training steps 100
Train batch size 256
Rollouts per prompt 8
Generations per step 2048
PPO mini batch size 8
Learning rate 1e-6
LR warmup steps 10
Weight decay 0.01
Grad clip 1.0
Max prompt length 2048
Max response length 8192
Max model length 10240
Train temperature 1.0
Train top_p 1.0
Validation generations 16
Validation temperature 0.6
Validation top_p 0.95
vLLM GPU memory utilization 0.8
GPUs 8 x NVIDIA H200
Save frequency every 10 steps
Validation frequency every 10 steps
Token reweight lambda 0.5
Token reweight eps_w 0.2
Token reweight decay steps 0
Teacher update rate 0.05
Max reprompt length 10240

Metrics

Metric Value
Final training step 100
Final critic/score/mean 0.890137
Final critic/rewards/mean 0.890137
Final validation avg@16 0.700000
Peak validation avg@16 0.731250
Peak validation step 60

Loading

Root final checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("SeongryongJung/Qwen3-4B-Physics-RLSD")
tokenizer = AutoTokenizer.from_pretrained("SeongryongJung/Qwen3-4B-Physics-RLSD")

Best avg@16 checkpoint:

model = AutoModelForCausalLM.from_pretrained("SeongryongJung/Qwen3-4B-Physics-RLSD", subfolder="best_avg16")
tokenizer = AutoTokenizer.from_pretrained("SeongryongJung/Qwen3-4B-Physics-RLSD", subfolder="best_avg16")

Intended Use

This model is intended for research on RL fine-tuning and self-distillation behavior on science/generalization tasks. It has not been broadly safety evaluated for production use.

Limitations

The reported scores are training-time and validation-time metrics from the local experimental setup. They should not be interpreted as broad benchmark results without independent evaluation.

Downloads last month
15
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SeongryongJung/Qwen3-4B-Physics-RLSD

Finetuned
Qwen/Qwen3-4B
Finetuned
(874)
this model