math-slm-sft-dpo-v5
LoRA SFT + DPO fine-tune of DeepSeek-R1-Distill-Qwen-7B on math reasoning. The base model self-distills on NuminaMath-TIR; LoRA adapters are trained on filtered teacher outputs.
Training
- Base: DeepSeek-R1-Distill-Qwen-7B (frozen, BF16)
- Adapter: LoRA r=16, alpha=32, targets all attention + MLP projections (~40M trainable)
- SFT: 3663 examples (teacher self-distill, math_verify-filtered), 3 epochs, effective batch 64
- DPO: 1961 preference pairs (correct vs incorrect CoT from same prompt), 2 epochs, LR 5e-7, beta=0.1
- Infrastructure: 8x H100 SXM, DeepSpeed ZeRO-3, vLLM TP=4 for data gen and eval
Eval (greedy, max_new_tokens=2048)
| Bench |
Baseline |
SFT |
SFT+DPO (this model) |
Delta vs Base |
| GSM8K (n=200) |
85.0% |
79.5% |
81.0% |
-4.0 pp |
| MATH-500 (n=500) |
60.0% |
73.8% |
73.8% |
+13.8 pp |
| AIME 2024+2025 (n=60) |
5.0% |
11.7% |
11.7% |
+6.7 pp |
| MathNet (n=500) |
23.6% |
34.0% |
35.0% |
+11.4 pp |
| Average |
43.4% |
49.8% |
50.4% |
+7.0 pp |
Notes
- SFT delivered the bulk of the gain (+6.4 pp avg). DPO at this conservative LR/step config was effectively a no-op (+0.6 pp).
- GSM8K regression is the verbose-CoT-hits-max-tokens failure mode learned from NuminaMath-TIR. Higher max_new_tokens or BoN voting at inference would recover it.