Mistral-7B-v0.1-ARC-SFT-SimPER

Fine-tuned mistralai/Mistral-7B-v0.1 on synthetic grade-school science QA via SFT → SimPER to maximize ARC Challenge acc_norm.

Benchmark Results

Model	acc	acc_norm
Mistral-7B-v0.3	0.5734	0.6024
Ministral-3-8B-Base	0.5188	0.5580
Qwen3.5-9B-Base	0.6578	0.7065
Qwen3.6-35B-A3B-FP8 (teacher)	0.6724	0.7253
Mistral-7B-v0.1 (base)	0.5691	0.6126
+ SFT	0.6015	0.6459
+ SFT + SimPER (this model)	0.6903	0.7244

Evaluated on ARC Challenge 25-shot. This 7B model matches the 35B teacher model on acc_norm and surpasses it on acc.

Method

Tushe grade-school STEM textbooks
  → science filtering (math removed)
  → chunking + dedup (Mistral tokenizer, min 400 tokens)
  → active QA generation (Qwen3.6-35B-FP8 teacher)
      · 5 sampled QA strategies per chunk
      · 5 sampled wrong-answer types per chunk
  → 55,022 synthetic triplets (question, answer_correct, answer_wrong)
  → Stage 1: SFT on correct answers (55,022 examples, 3 epochs)
  → Stage 2: SimPER on preference pairs (30,000 pairs, 3 epochs)
  → QLoRA merge → FP16

Why SimPER?

SimPER is a reference-free preference optimization method with no beta/gamma hyperparameters. It is memory-efficient under QLoRA and trains stably without a separate reference model.

Prompt Format

Question: {question}
Answer:

This matches the ARC Challenge evaluation template exactly.

Training Details

Item	Value
Base model	`mistralai/Mistral-7B-v0.1`
Data source	`AlaminI/tushe-grade-school-stem`
Teacher model	`Qwen/Qwen3.6-35B-A3B-FP8`
Science chunks	21,310
Synthetic triplets	55,022
SFT examples	55,022
SimPER preference pairs	30,000
QLoRA rank	64
QLoRA alpha	128
QLoRA target	all-linear
Quantization	4-bit nf4, double quantization
Compute dtype	bf16
Optimizer	paged_adamw_8bit
Learning rate	1e-5
Scheduler	cosine
Warmup ratio	0.05
SFT epochs	3
SimPER epochs	3
Hardware	RTX A6000 48GB × 4
Final weights	FP16 merged

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "Question: What causes the seasons on Earth?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Synthetic QA is generated from single chunks, so multi-hop reasoning across distant paragraphs may be weaker.
Wrong answers are synthetic distractors and may not perfectly match the distribution of real ARC answer choices.
Optimized specifically for ARC Challenge; may not generalize to other benchmarks without further tuning.

References

Topic	Reference
Task-specific active reading	arXiv:2508.09494
SimPER	OpenReview
QA strategy sources	NAEP/NCES, TIMSS & PIRLS, NGSS, NSTA
Wrong-answer type sources	National Academies/NRC, TIMSS, NAEP, PISA/OECD, NGSS, UC Berkeley Understanding Science, AAAS Project 2061

Downloads last month: 19

Safetensors

Model size

7B params

Tensor type

F16

Model tree for HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER

Base model

mistralai/Mistral-7B-v0.1

Finetuned

(1032)

this model

Dataset used to train HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER

Paper for HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER

Learning Facts at Scale with Active Reading

Paper • 2508.09494 • Published Aug 13, 2025