math-slm-sft-dpo-v5

LoRA SFT + DPO fine-tune of DeepSeek-R1-Distill-Qwen-7B on math reasoning. The base model self-distills on NuminaMath-TIR; LoRA adapters are trained on filtered teacher outputs.

Training

Base: DeepSeek-R1-Distill-Qwen-7B (frozen, BF16)
Adapter: LoRA r=16, alpha=32, targets all attention + MLP projections (~40M trainable)
SFT: 3663 examples (teacher self-distill, math_verify-filtered), 3 epochs, effective batch 64
DPO: 1961 preference pairs (correct vs incorrect CoT from same prompt), 2 epochs, LR 5e-7, beta=0.1
Infrastructure: 8x H100 SXM, DeepSpeed ZeRO-3, vLLM TP=4 for data gen and eval

Eval (greedy, max_new_tokens=2048)

Bench	Baseline	SFT	SFT+DPO (this model)	Delta vs Base
GSM8K (n=200)	85.0%	79.5%	81.0%	-4.0 pp
MATH-500 (n=500)	60.0%	73.8%	73.8%	+13.8 pp
AIME 2024+2025 (n=60)	5.0%	11.7%	11.7%	+6.7 pp
MathNet (n=500)	23.6%	34.0%	35.0%	+11.4 pp
Average	43.4%	49.8%	50.4%	+7.0 pp

Notes

SFT delivered the bulk of the gain (+6.4 pp avg). DPO at this conservative LR/step config was effectively a no-op (+0.6 pp).
GSM8K regression is the verbose-CoT-hits-max-tokens failure mode learned from NuminaMath-TIR. Higher max_new_tokens or BoN voting at inference would recover it.

Downloads last month: 14

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MR0b0t/math-slm-sft-dpo-v5

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Adapter

(119)

this model

MR0b0t
/

math-slm-sft-dpo-v5

math-slm-sft-dpo-v5

Training

Eval (greedy, max_new_tokens=2048)

Notes

Model tree for MR0b0t/math-slm-sft-dpo-v5

Dataset used to train MR0b0t/math-slm-sft-dpo-v5