Introduction
MetaStone-L1 is the lite reasoning model of the MetaStone series, which aims to enhance the performance in hard downstream tasks.
On core reasoning benchmarks including mathematics and code, MetaStone-L1-7B achieved SOTA results in the parallel-level models, and it also achieved the comparable results as the API models such as Claude-3.5-Sonnet-1022 and GPT4o-0513.
This repo contains the MetaStone-L1-7B model, which is trained based on DeepSeek-R1-Distill-Qwen-7B by GRPO. For full details of this model please refer to our release blog.
Requirements
We advise you to use the latest version of transformers(transformers==4.48.3
). For the best experience, please review the Usage Guidelines.
Quickstart
Here give the example of how to use our model.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MetaStoneTec/MetaStone-L1-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
messages = [
{"role": "user", "content": "Complete the square for the following quadratic: $-x^2+7 x-11$\n\nPlease reason step by step, and put your final answer within \\boxed{}."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Usage Guidelines
To achieve optimal performance, we recommend the following settings:
Enhace the thoughful output:
a. Make sure the model starts with
<think>\n
to prevent generating empty think content. If you useapply_chat_template
and setadd_generation_prompt=True
, this is automatically implemented, but this may result in replies not having a tag at the beginning, which is normal.b. Ensure the final input of the model is in the format of
<|User|> [your prompt] <|Assistant|><think>
.Use a temperature of 0.6, a top sampling probability of 0.95, a maximum generation length of 32k.
Standardize output format: We recommend using hints to standardize model outputs when benchmarking.
a. Math questions: Add a statement "
Please reason step by step, and put your final answer within \\boxed{}.
" to the prompt.b. Code problems: Add "### Format: Read the inputs from stdin solve the problem and write the answer to stdout. Enclose your code within delimiters as follows.\n ```python\n# YOUR CODE HERE\n```\n### Answer: (use the provided format with backticks)" to the prompt.
In particular, we use
latex2sympy2
andsympy
to assist in judging complex Latex formats for the Math500 evaluation script. For all datasets, we generate 64 responses per query to estimate pass@1.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{MetaStoneL17B,
title = {MetastoneL17B},
url = {https://huggingface.co/MetaStoneTec/MetaStone-L1-7B},
author = {MetaStone Team},
month = {March},
year = {2025}
}
@article{wang2024graph,
title={A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions},
author={Wang, Jiankang and Xu, Jianjun and Wang, Xiaorui and Wang, Yuxin and Xing, Mengting and Fang, Shancheng and Chen, Zhineng and Xie, Hongtao and Zhang, Yongdong},
journal={arXiv preprint arXiv:2412.08864},
year={2024}
}
- Downloads last month
- 6