Mistral-7B-v0.1-ARC-SFT-SimPER

Fine-tuned mistralai/Mistral-7B-v0.1 on synthetic grade-school science QA via SFT → SimPER to maximize ARC Challenge acc_norm.

Benchmark Results

Model acc acc_norm
Mistral-7B-v0.3 0.5734 0.6024
Ministral-3-8B-Base 0.5188 0.5580
Qwen3.5-9B-Base 0.6578 0.7065
Qwen3.6-35B-A3B-FP8 (teacher) 0.6724 0.7253
Mistral-7B-v0.1 (base) 0.5691 0.6126
+ SFT 0.6015 0.6459
+ SFT + SimPER (this model) 0.6903 0.7244

Evaluated on ARC Challenge 25-shot. This 7B model matches the 35B teacher model on acc_norm and surpasses it on acc.

Method

Tushe grade-school STEM textbooks
  → science filtering (math removed)
  → chunking + dedup (Mistral tokenizer, min 400 tokens)
  → active QA generation (Qwen3.6-35B-FP8 teacher)
      · 5 sampled QA strategies per chunk
      · 5 sampled wrong-answer types per chunk
  → 55,022 synthetic triplets (question, answer_correct, answer_wrong)
  → Stage 1: SFT on correct answers (55,022 examples, 3 epochs)
  → Stage 2: SimPER on preference pairs (30,000 pairs, 3 epochs)
  → QLoRA merge → FP16

Why SimPER?

SimPER is a reference-free preference optimization method with no beta/gamma hyperparameters. It is memory-efficient under QLoRA and trains stably without a separate reference model.

Prompt Format

Question: {question}
Answer:

This matches the ARC Challenge evaluation template exactly.

Training Details

Item Value
Base model mistralai/Mistral-7B-v0.1
Data source AlaminI/tushe-grade-school-stem
Teacher model Qwen/Qwen3.6-35B-A3B-FP8
Science chunks 21,310
Synthetic triplets 55,022
SFT examples 55,022
SimPER preference pairs 30,000
QLoRA rank 64
QLoRA alpha 128
QLoRA target all-linear
Quantization 4-bit nf4, double quantization
Compute dtype bf16
Optimizer paged_adamw_8bit
Learning rate 1e-5
Scheduler cosine
Warmup ratio 0.05
SFT epochs 3
SimPER epochs 3
Hardware RTX A6000 48GB × 4
Final weights FP16 merged

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "Question: What causes the seasons on Earth?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Synthetic QA is generated from single chunks, so multi-hop reasoning across distant paragraphs may be weaker.
  • Wrong answers are synthetic distractors and may not perfectly match the distribution of real ARC answer choices.
  • Optimized specifically for ARC Challenge; may not generalize to other benchmarks without further tuning.

References

Topic Reference
Task-specific active reading arXiv:2508.09494
SimPER OpenReview
QA strategy sources NAEP/NCES, TIMSS & PIRLS, NGSS, NSTA
Wrong-answer type sources National Academies/NRC, TIMSS, NAEP, PISA/OECD, NGSS, UC Berkeley Understanding Science, AAAS Project 2061
Downloads last month
19
Safetensors
Model size
7B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER

Finetuned
(1032)
this model

Dataset used to train HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER

Paper for HwanChang0106/Mistral-7B-v0.1-ARC-SFT-SimPER