Qwen3-4B Chemistry RLSD

This repository contains Chemistry fine-tuned Qwen3-4B checkpoints from the local SciKnowEval-style generalization setup.

  • Root checkpoint: final global_step_100 merged to Hugging Face safetensors.
  • best_avg16/: checkpoint with the highest validation avg@16 during training, merged to Hugging Face safetensors.

Checkpoints

Checkpoint Source step Validation avg@16 best@16 / pass@16 maj@16
Root final 100 0.789286 0.855124 0.795124
best_avg16/ 90 0.804167 0.855981 0.806176

Training Run

qwen3gen-chemistry-RLSD-Qwen-Qwen3-4B-mbs8-decay0-ema0.05-train256-rollout8-lr1e-6-vllm0.8

Base Model

  • Base model: Qwen/Qwen3-4B
  • Fine-tuning type: full-parameter FSDP RL training
  • Dataset: datasets/sciknoweval/chemistry
  • Train split: 1,890 examples
  • Validation split: 210 examples

Method

  • Method: RLSD
  • Config: rlsd
  • Policy loss mode: rlsd
  • Reward: local SciKnowEval multiple-choice reward checker
  • Rollout correction: token-level importance sampling, threshold 2.0

Hyperparameters

Field Value
Base model Qwen/Qwen3-4B
Training steps 100
Train batch size 256
Rollouts per prompt 8
Generations per step 2048
PPO mini batch size 8
Learning rate 1e-6
LR warmup steps 10
Weight decay 0.01
Grad clip 1.0
Max prompt length 2048
Max response length 8192
Max model length 10240
Train temperature 1.0
Train top_p 1.0
Validation generations 16
Validation temperature 0.6
Validation top_p 0.95
vLLM GPU memory utilization 0.8
GPUs 8 x NVIDIA H200
Save frequency every 10 steps
Validation frequency every 10 steps
Token reweight lambda 0.5
Token reweight eps_w 0.2
Token reweight decay steps 0
Teacher update rate 0.05
Max reprompt length 10240

Metrics

Training score

CSV files:

Metric Value
Final training step 100
Final critic/score/mean 0.905762
Final critic/rewards/mean 0.905762
Final validation avg@16 0.789286
Peak validation avg@16 0.804167
Peak validation step 90

Loading

Root final checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("SeongryongJung/Qwen3-4B-Chemistry-RLSD")
tokenizer = AutoTokenizer.from_pretrained("SeongryongJung/Qwen3-4B-Chemistry-RLSD")

Best avg@16 checkpoint:

model = AutoModelForCausalLM.from_pretrained("SeongryongJung/Qwen3-4B-Chemistry-RLSD", subfolder="best_avg16")
tokenizer = AutoTokenizer.from_pretrained("SeongryongJung/Qwen3-4B-Chemistry-RLSD", subfolder="best_avg16")

Intended Use

This model is intended for research on RL fine-tuning and self-distillation behavior on science/generalization tasks. It has not been broadly safety evaluated for production use.

Limitations

The reported scores are training-time and validation-time metrics from the local experimental setup. They should not be interpreted as broad benchmark results without independent evaluation.

Downloads last month
39
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SeongryongJung/Qwen3-4B-Chemistry-RLSD

Finetuned
Qwen/Qwen3-4B
Finetuned
(874)
this model

Collection including SeongryongJung/Qwen3-4B-Chemistry-RLSD