Text Generation
Safetensors
qwen2
conversational

We recommend using our successor s1.1 with better performance

Model Summary

s1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.

Use

The model usage is documented here.

Evaluation

Metric s1-32B s1.1-32B o1-preview o1 DeepSeek-R1 DeepSeek-R1-Distill-Qwen-32B
# examples 1K 1K ? ? >800K 800K
AIME2024 56.7 56.7 40.0 74.4 79.8 72.6
AIME2025 I 26.7 60.0 37.5 ? 65.0 46.1
MATH500 93.0 95.4 81.4 94.8 97.3 94.3
GPQA-Diamond 59.6 63.6 75.2 77.3 71.5 62.1

Note that s1-32B and s1.1-32B use budget forcing in this table; specifically ignoring end-of-thinking and appending "Wait" up to four times.

Citation

@misc{muennighoff2025s1simpletesttimescaling,
      title={s1: Simple test-time scaling}, 
      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
      year={2025},
      eprint={2501.19393},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.19393}, 
}
Downloads last month
13,003
Safetensors
Model size
32.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for simplescaling/s1-32B

Finetunes
4 models
Merges
2 models
Quantizations
11 models

Dataset used to train simplescaling/s1-32B

Spaces using simplescaling/s1-32B 6