Qwen2.5-Math-DeepSeekR1-Sens-7B
A 7B merged model created by applying Sensitivity-aware Model Merging (Sens Merging) to:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- Qwen/Qwen2.5-Math-7B
The goal of this model is to preserve the strong mathematical reasoning ability of DeepSeek-R1-Distill while significantly reducing reasoning verbosity and output token length.
Highlights
- Average accuracy: 66.9%
- Average output tokens: 701
- Output tokens reduced by 75.2% compared to DeepSeek-R1-Distill-Qwen-7B
- Only 2.5 points lower average accuracy than DeepSeek-R1-Distill-Qwen-7B
This model provides an attractive trade-off between reasoning quality and inference cost.
Base Models
| Model | Avg Accuracy | Avg Tokens |
|---|---|---|
| DeepSeek-R1-Distill-Qwen-7B | 69.4 | 2826 |
| Qwen2.5-Math-7B | 45.3 | 755 |
| Sens Merge (λ=0.4) | 66.9 | 701 |
Benchmark Results
| Benchmark | Distill | Qwen2.5-Math | Sens Merge (λ=0.4) |
|---|---|---|---|
| College Math | 66.0 | 37.9 | 70.4 |
| GSM8K | 90.2 | 84.5 | 90.6 |
| MATH | 94.4 | 73.3 | 90.2 |
| Minerva Math | 41.5 | 13.6 | 36.0 |
| OlympiadBench | 55.0 | 17.3 | 47.2 |
| Avg Accuracy | 69.4 | 45.3 | 66.9 |
| Avg Tokens | 2826 | 755 | 701 |
Motivation
Large reasoning models such as DeepSeek-R1-Distill often produce long chains of thought, which increases inference cost.
This model explores whether model merging can reduce reasoning verbosity without requiring additional training.
By merging a reasoning model (DeepSeek-R1-Distill-Qwen-7B) with a compact mathematical model (Qwen2.5-Math-7B) using Sensitivity-aware Model Merging, the merged model:
- Maintains competitive reasoning performance
- Produces significantly shorter outputs
- Requires no gradient-based fine-tuning
- Uses only a small calibration dataset
Comparison with DPO
We additionally compared Sens Merging with a DPO-trained model:
| Model | Avg Accuracy | Avg Tokens |
|---|---|---|
| DeepSeek-R1-Distill-Qwen-7B | 69.4 | 2826 |
| DPO | 68.55 | 2402 |
| Sens Merge (λ=0.4) | 66.9 | 701 |
Sens Merging achieves a much larger reduction in output length while remaining competitive in accuracy.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "quangdung/Qwen2.5-Math-DeepSeekR1-Sens-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Solve: If x^2 + 5x + 6 = 0, find x."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 60