Tushe/tushe-grade-school-stem
Updated • 201
Fine-tuned mistralai/Mistral-7B-v0.1 on synthetic grade-school science QA via SFT → SimPER to maximize ARC Challenge acc_norm.
| Model | acc | acc_norm |
|---|---|---|
| Mistral-7B-v0.3 | 0.5734 | 0.6024 |
| Ministral-3-8B-Base | 0.5188 | 0.5580 |
| Qwen3.5-9B-Base | 0.6578 | 0.7065 |
| Qwen3.6-35B-A3B-FP8 (teacher) | 0.6724 | 0.7253 |
| Mistral-7B-v0.1 (base) | 0.5691 | 0.6126 |
| + SFT | 0.6015 | 0.6459 |
| + SFT + SimPER (this model) | 0.6903 | 0.7244 |
Evaluated on ARC Challenge 25-shot. This 7B model matches the 35B teacher model on acc_norm and surpasses it on acc.
Tushe grade-school STEM textbooks
→ science filtering (math removed)
→ chunking + dedup (Mistral tokenizer, min 400 tokens)
→ active QA generation (Qwen3.6-35B-FP8 teacher)
· 5 sampled QA strategies per chunk
· 5 sampled wrong-answer types per chunk
→ 55,022 synthetic triplets (question, answer_correct, answer_wrong)
→ Stage 1: SFT on correct answers (55,022 examples, 3 epochs)
→ Stage 2: SimPER on preference pairs (30,000 pairs, 3 epochs)
→ QLoRA merge → FP16
SimPER is a reference-free preference optimization method with no beta/gamma hyperparameters. It is memory-efficient under QLoRA and trains stably without a separate reference model.
Question: {question}
Answer:
This matches the ARC Challenge evaluation template exactly.
| Item | Value |
|---|---|
| Base model | mistralai/Mistral-7B-v0.1 |
| Data source | AlaminI/tushe-grade-school-stem |
| Teacher model | Qwen/Qwen3.6-35B-A3B-FP8 |
| Science chunks | 21,310 |
| Synthetic triplets | 55,022 |
| SFT examples | 55,022 |
| SimPER preference pairs | 30,000 |
| QLoRA rank | 64 |
| QLoRA alpha | 128 |
| QLoRA target | all-linear |
| Quantization | 4-bit nf4, double quantization |
| Compute dtype | bf16 |
| Optimizer | paged_adamw_8bit |
| Learning rate | 1e-5 |
| Scheduler | cosine |
| Warmup ratio | 0.05 |
| SFT epochs | 3 |
| SimPER epochs | 3 |
| Hardware | RTX A6000 48GB × 4 |
| Final weights | FP16 merged |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
prompt = "Question: What causes the seasons on Earth?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
| Topic | Reference |
|---|---|
| Task-specific active reading | arXiv:2508.09494 |
| SimPER | OpenReview |
| QA strategy sources | NAEP/NCES, TIMSS & PIRLS, NGSS, NSTA |
| Wrong-answer type sources | National Academies/NRC, TIMSS, NAEP, PISA/OECD, NGSS, UC Berkeley Understanding Science, AAAS Project 2061 |
Base model
mistralai/Mistral-7B-v0.1