openai/gsm8k
Benchmark • Updated • 17.6k • 933k • 1.32k
Qwen2.5-3B-Instruct that was fine-tuned on perplexity-filtered OpenHermes 2.5 (which damaged its reasoning), then partially restored by copying back the top-5 most-damaged MLP layers from the base model. No retraining. Just weight surgery.
lm-eval, GSM8K flexible-extract 5-shot, ARC Challenge acc_norm 0-shot, no chat template, batch_size 8, single seed (2026-05-07).
| Model | GSM8K | ARC Challenge |
|---|---|---|
| Base (Qwen2.5-3B-Instruct) | 63.15% | 48.12% |
| After SFT (broken) | 61.64% | 45.22% |
| Restore top 5 | 63.00% | 45.73% |
| Restore top 15 | 63.46% | 46.50% |
| Restore top 30 | 64.29% | 48.55% |
| Restore specificity top 10 | 61.64% | 45.22% |
This model is the "Restore top 5" row.
Single seed. Magnitudes are 1pt. The "no chat template" eval style means absolute numbers are below what you'd see with chat template applied (78% GSM8K), but relative comparisons across the same setup are meaningful.