zwhe99/DeepMath-103K
Viewer • Updated • 103k • 6.23k • 362
Qwen3-1.7B fine-tuned with ExOPD (Extended Group Relative Policy Optimization) on DeepMath-103K mathematical reasoning dataset.
| Component | Details |
|---|---|
| Base Model | Qwen/Qwen3-1.7B |
| Algorithm | ExOPD (GRPO + Rollout Correction) |
| Dataset | zwhe99/DeepMath-103K |
| Filter | difficulty ≥ 6 (Olympiad level) |
| Samples | 8,000 problems |
| Teacher Model | Keven16/Qwen3-4B-Non-Thinking-RL-Math-Step500 |
| Epochs | 3 |
| Batch Size | 256 |
| Learning Rate | 1e-5 |
TBD
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("jindun/Qwen3-1.7B-GOPD-DeepMath", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("jindun/Qwen3-1.7B-GOPD-DeepMath")
Supervised Fine-Tuning (SFT) degraded performance (-16.67%). This demonstrates the limitations of imitation learning. GOPD learns genuine reasoning skills through trial-and-error exploration.
Apache 2.0