math-slm-sft-dpo-v5

LoRA SFT + DPO fine-tune of DeepSeek-R1-Distill-Qwen-7B on math reasoning. The base model self-distills on NuminaMath-TIR; LoRA adapters are trained on filtered teacher outputs.

Training

  • Base: DeepSeek-R1-Distill-Qwen-7B (frozen, BF16)
  • Adapter: LoRA r=16, alpha=32, targets all attention + MLP projections (~40M trainable)
  • SFT: 3663 examples (teacher self-distill, math_verify-filtered), 3 epochs, effective batch 64
  • DPO: 1961 preference pairs (correct vs incorrect CoT from same prompt), 2 epochs, LR 5e-7, beta=0.1
  • Infrastructure: 8x H100 SXM, DeepSpeed ZeRO-3, vLLM TP=4 for data gen and eval

Eval (greedy, max_new_tokens=2048)

Bench Baseline SFT SFT+DPO (this model) Delta vs Base
GSM8K (n=200) 85.0% 79.5% 81.0% -4.0 pp
MATH-500 (n=500) 60.0% 73.8% 73.8% +13.8 pp
AIME 2024+2025 (n=60) 5.0% 11.7% 11.7% +6.7 pp
MathNet (n=500) 23.6% 34.0% 35.0% +11.4 pp
Average 43.4% 49.8% 50.4% +7.0 pp

Notes

  • SFT delivered the bulk of the gain (+6.4 pp avg). DPO at this conservative LR/step config was effectively a no-op (+0.6 pp).
  • GSM8K regression is the verbose-CoT-hits-max-tokens failure mode learned from NuminaMath-TIR. Higher max_new_tokens or BoN voting at inference would recover it.
Downloads last month
14
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MR0b0t/math-slm-sft-dpo-v5

Adapter
(119)
this model

Dataset used to train MR0b0t/math-slm-sft-dpo-v5