mlp-surgery — broken baseline (Qwen2.5-3B)

The "broken" baseline used as input for the mlp-surgery project. Don't use this model for downstream tasks — it underperforms the base model on both math and general reasoning. It's published only so the experiment is reproducible.

What it is

Qwen2.5-3B-Instruct + LoRA fine-tune on the perplexity-filtered top-30% of OpenHermes 2.5 (from the sister project Perplexity-weighted-selective-finetuning), merged into the base weights.

Eval

lm-eval, GSM8K flexible-extract 5-shot, ARC Challenge acc_norm 0-shot, no chat template, batch_size 8, single seed (2026-05-07).

Model	GSM8K	ARC Challenge
Base (Qwen2.5-3B-Instruct)	63.15%	48.12%
After SFT (broken)	61.64%	45.22%
Restore top 5	63.00%	45.73%
Restore top 15	63.46%	46.50%
Restore top 30	64.29%	48.55%
Restore specificity top 10	61.64%	45.22%